scm2java.html 16 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345
  1. <html> <head>
  2. <title>Byte-compilation of Scheme using Java byte-codes</title>
  3. </head>
  4. <body>
  5. <h1>Byte-compilation of Scheme using Java byte-codes</h1>
  6. <address>
  7. <a href="http://www.cygnus.com/~bothner">Per Bothner</a>
  8. <a href="mailto:bothner@cygnus.com"> bothner@cygnus.com</a>
  9. </address>
  10. May 1996
  11. <p>
  12. <em>(This design document discusses the issues of implementing
  13. Java in Scheme. It predates the implementation, and does not
  14. necessarily match the current Kawa implementation.)</em>
  15. <p>
  16. Our biggest outstanding deliverable for Guile is a byte-compiler
  17. and -interpreter. Using Java has been suggested, and there is
  18. wide-spread interest in using Java byte codes for other languages.
  19. For example, <a href=http://www.inmet.com"">InterMetrics</a> is
  20. porting their <a HREF="http://www.inmet.com/~stt/adajava_paper/">
  21. Ada95 compiler to emit Java byte-codes</a>.
  22. See <a href="http://www.cs.tu-berlin.de/~tolk/vmlanguages.html">here</a>
  23. for more languages that use Java.
  24. <p>
  25. There is already an implementation
  26. of Scheme on top of Java. This is "Kawa", written by
  27. <a href="mailto:sgml@winternet.com"> R. Alexander Milowski</a>.
  28. There are some major problems with Kawa 0.1, and it needs a lot
  29. of re-writing, but it may be a useful starting point.
  30. <p>
  31. There are some different possible targets for Scheme-in-Java:
  32. <ul>
  33. <li>A Scheme byte-code compiler with support libraries that will run on
  34. any Java implementation. The support libraries will implement
  35. the Scheme primitives. For maximum portability, it should be
  36. written in Java, but it is also possible to use "native" methods
  37. (i.e. written in C).
  38. <li>The same, but we target a specific Java implementation that has
  39. some extensions to support Scheme better.
  40. <li>A byte-code compiler that targets a Java interpreter that is
  41. implemented on top of a Scheme system. For example, we could target
  42. the (very incomplete) "Latte" sub-system of Guile.
  43. <li>A hybrid system: Depending on compiler switches, it could emit either
  44. portable bytecodes or Scheme-enhanced bytecodes. The latter might
  45. be either more efficient or provide features not available in the
  46. portable Scheme-in-Java implementation.
  47. </ul>
  48. <p>
  49. Related design issues include:
  50. <ul>
  51. <li>Execution efficiency.
  52. <li>Class file size and loading speed.
  53. <li>Need for general <a href="#Tail-calls">tail-call-elimination</a>.
  54. If general tail-call-elimination is required, then it is not possible
  55. to produce portable bytecodes with a direct translation, though it is
  56. possible to use Henry Baker's trick (a function returns its continuation
  57. to its caller), which is also used in RScheme.
  58. <li>Need for full or restricted <a href="#Continuations">continuations</a>.
  59. It is not possible to implement general <code>call/cc</code> using
  60. direct translation to portable bytecodes. It is possible if you re-write to
  61. continuation-passing style.
  62. Restricted call/cc that only returns to a caller is possible.
  63. </ul>
  64. <p>
  65. <h2>Data types</h2>
  66. One issue how to represent Scheme values using Java types.
  67. Because of Scheme's dynamic typing, we represent Scheme values
  68. using Java object references. Scheme objects inherit from
  69. the <code>Object</code> class, like all Java objects.
  70. One might add an intermediate class <code>Scheme_Object</code>,
  71. and have all Scheme types be derived from it. That is possible, but I
  72. don't think it is a good idea. I think it is better to re-use existing
  73. Java classes. For example the Scheme values <code>#f</code> and
  74. <code>#t</code> should be the Java objects <code>Boolean.FALSE</code>
  75. and <code>Boolean.True</code>. This does imply
  76. that one cannot use that standard Java print functions (since they
  77. would print in Java syntax, rather than Scheme syntax).
  78. <p>
  79. Traditionally, Scheme (and Lisp) implementations represent objects
  80. using a short union, with special bit "tags" to distinguish heap
  81. objects and immediate values (such as fixnums). This allows
  82. some operations (including fixnum arithmetic) to be more efficient.
  83. Unfortunately, there is no way to do this with Java bytecodes.
  84. <a href="mailto:shivers@ai.mit.edu">Olin Shivers</a> proposes a Java
  85. <a href="http://www.ai.mit.edu/people/shivers/javaScheme.html">
  86. extension to allow tagged types</a>.
  87. I don't think this has much chance of being standardized,
  88. and I think pre-allocation of common values (characters and
  89. small ints) can give some of the benefits without changing the VM.
  90. <p>
  91. Here are my suggestions for representing the various Scheme types:
  92. <dl>
  93. <dt><code>#f</code> and <code>#t</code>
  94. <dd>Use the standard Java class <code>Boolean</code>.
  95. <dt>characters <dd> Use standard Java class <code>Character</code>.
  96. Two minor problems:
  97. <ul>
  98. <li>Java only supports 16-bit Unicode characters. It is probably possible
  99. to make a Java system that can support more characters, while still
  100. supporting standard Java bytecode files.
  101. <li>Character values would not compare <code>eq?</code>, unless we provide a
  102. system-wide "obarray" table for characters. Luckily, Scheme
  103. does not require characters that are <code>equal?</code>
  104. to be <code>eq?</code>.
  105. </ul>
  106. <dt>numbers <dd> Use the various standard Java classes.
  107. It may be reasonable to statically allocate say the integers -1 ... 100.
  108. Alternatively, one could use a global (weak-?) hash-table to ensure
  109. <code>equal?</code> integers compare <code>eq?</code>.
  110. <dt> pairs <dd> Add a new <code>Pair</code> class.
  111. <dt> () <dd> We could use the Java NIL value, or some dummy object.
  112. <dt> symbols <dd> Use the Java <code>String</code> class.
  113. Use <code>String.intern()</code> to make symbols unique.
  114. <dt> strings <dd> Use the standard Java <code>StringBuffer</code> class.
  115. <dt> vectors <dd> Use a Java array of <code>Object</code>.
  116. <dt> procedures/closures <dd> We have to use special objects.
  117. This merits further discussion (below).
  118. </dl>
  119. <h2>Procedures</h2>
  120. Most Scheme functions are fairly straight-forward to compile to
  121. Java byte-codes. However, there are some complications:
  122. <ul>
  123. <li>Scheme functions are first class <a href="#function-values">values</a>,
  124. that can be passed around as objects.
  125. <li>Nested functions may require <a href="#closures">closures</a>
  126. to be created.
  127. <li><a href="#Tail-calls">Tail-calls</a> are required to not cause stack
  128. growth.
  129. <li>First-class <a href="#Continuations">continuations</a> may
  130. require the stack state to copied.
  131. </ul>
  132. <p>
  133. Java does not have <a name="function-values">function values</a>.
  134. They can be simulated with a new
  135. <code>Procedure</code> class with a (virtual) <code>apply</code> method.
  136. Each function gets implemented as a separate sub-class of
  137. <code>Procedure</code>, and the Scheme code of the function is
  138. compiled into an <code>apply</code> method for that
  139. <code>Procedure</code> sub-class.
  140. <p>
  141. In some cases, a number of functions can use the same <code>Procedure</code>
  142. sub-class. For example, the <code>c[ad]*r</code> functions could be
  143. implemented with a single <code>Procedure</code> sub-class, with
  144. a field containing an encoding of the <code>[ad]*</code>.
  145. The <code>apply</code> method uses that field to do the right thing.
  146. <p>
  147. A <a name="closures">closure</a> requires a <code>Procedure</code>
  148. sub-class with a field that contains the static context.
  149. The static context can be an array,
  150. though it can also be an <code>Object</code> of some suitable class.
  151. Evaluating the nested lambda expression will allocate a context object,
  152. and then "new" the <code>Procedure</code> sub-class, passing it the
  153. context object. The <code>apply</code> method well get non-local
  154. bindings using the context object.
  155. <h2><A NAME="Tail-calls">Tail call elimination<a/></h2>
  156. The most important use of tail-calls is probably tail-recursion, where
  157. a procedure calls itself (by name). This is easily compiled using
  158. a Java GOTO byte-operation. (This is more difficult if translating
  159. to Java source code, since Java does not have a goto statement, though
  160. Java byte-codes do.)
  161. <p>
  162. Mutual tail-calls among multiple functions compiled together can be
  163. handled by compiling all the functions in one big procedure, and
  164. and starting it with a switch operation to jump to the code for the
  165. desired function. In that case, a tail-call to one of the other
  166. functions can easily be compiled as a GOTO. It is not clear whether
  167. this is worthwhile, given that it still does not get you general
  168. tail-call elimination.
  169. <p>
  170. General tail-call elimination can be implemented using Baker's
  171. trick (a function returns the address of the next function to call),
  172. but this constrains you calling convention strongly, and for various
  173. reasons is messy and undesirable in a Java context.
  174. <p>
  175. The ideal way to support general tail-call elimination is to just
  176. do it in the Java interpreter (either for all tail-calls or using
  177. a new operation). The more optimized and tuned the interpreter
  178. is, the more likely is that that this will be tricky and
  179. machine-dependent, since it may involve messy adjustments
  180. to the C stack. It has been proposed that gcc be enhanced to
  181. support a new (extra) calling convention such that the <em>called</em>
  182. function be responsible for popping the arguments of the stack (rather
  183. than the caller). C programs could specify an <code>__attribute__</code> to
  184. indicate that a function should use the new calling convention (thus
  185. regular functions can call functions using the new convention, and
  186. vice versa). If the Java interpreter is carefully written to use
  187. the new calling convention where appropriate, then it should be fairly
  188. easy to implement general tail-call elimination. If Scheme code is
  189. compiled to native (C or assembler), it can also use the new calling
  190. convention, thus getting full tail-call elimination. The FSF
  191. (i.e. Stallman) is favorable to this approach, which would be
  192. quite a bit of work, but would be generally useful to anyone who
  193. needs general tail-call elimination.
  194. <h2><a name="Continuations">Continuations</a></h2>
  195. Implementating continuations (specifically the <code>call/cc</code>
  196. function) may be straight-forward in some environments. However, it
  197. cannot be implemented in Java, nor can it be implemented in portable
  198. C code.
  199. <p>
  200. One option is to not worry about fully implementation continuations.
  201. Most uses for continuations are to implement exception handling,
  202. threads, and back-tracking. Exception handling and threads are
  203. already provided by Java. Continuations used to implement
  204. (non-resuming) exception handling are a special case of continuations
  205. that only return to stack frames in the current call chain; these
  206. can be implemented using a combination of <code>Continuation</code> objects and
  207. exception handling. Guile already supports threads (not implemented
  208. using continuations), as does Java; using continuations to switch
  209. threads in such an environment is probably a bad idea. We do not
  210. have a ready interface for back-tracking, except by using continuations,
  211. but I don't think many Scheme applications use back-tracking.
  212. <p>
  213. If our Java-byte-code interpreter is embedded in a Scheme system that
  214. already supports continuations, then it is trivial to implement
  215. <code>call/cc</code>.
  216. <p>
  217. An alternative is to have the compiler transform a Scheme program
  218. to a form which does not need explicit continuations. One option
  219. is to translate to continuation-passing-style. I don't understand
  220. this will enough to comment on the implications.
  221. <p>
  222. <a name="Stalin" href="http://tochna.technion.ac.il/~qobi"> Jeffrey Mark Siskind</a>
  223. is working on a highly optimizing compiler to translate Scheme to C.
  224. It could possibly be re-targeted to emit Java byte-codes. However, it is
  225. written for for a "closed-world assumption" with global analysis of the whole
  226. program. There is as yet no support for separate compilation of independent
  227. modules, though that is supposedly planned. In any case, if highly-optimized
  228. Scheme code is important, there are probably ideas (and code)
  229. we can use in Stalin.
  230. Also, <a href="http://www.ccs.neu.edu/home/will"> William Clinger</a>
  231. has expressed plans to re-target his
  232. <a href="http://www.ccs.neu.edu/home/will/Twobit"> Twobit</a>
  233. optimizing compiler to
  234. emit Java byte-codes, and he believes it should be easy to do.
  235. <h2>File format</h2>
  236. A Java .class file contains more than just code.
  237. It also includes symbol-table information, constants, and other meta-data.
  238. We need to consider how this works for Scheme.
  239. One point to note is that a Java .class file contains the compilation
  240. of only a single class. I believe it is possible to compile a
  241. Scheme source file (or module) and only write a single .class file.
  242. However, that would restrict your implementation choices a bit,
  243. and otherwise warp the resulting structure, so it may be
  244. prefereable to generate multiple classes. This will require the
  245. use of multiple "helper" .class files, which is clumsy. It may
  246. be reasonable to extend the .class format to allow nested classes.
  247. <h3>Constants</h3>
  248. Scheme code includes constants of two kinds: Simple literals,
  249. and compound quoted S-expressions.
  250. Simple literals occurr in
  251. other languages (including Java), and are handled similarly:
  252. <ul>
  253. <li>Standard constants (such as <code>#f</code> are compiled into
  254. references to static members of standard classes
  255. (such as <code>Boolean.FALSE</code>).
  256. <li>Quoted Symbols can be compiled the same way the Java compiler
  257. handles string literals.
  258. <li>Strings and vectors be be compiled into references to static fields
  259. that get initialized at class-loading-time.
  260. <li>Character and integer literals can also be handled using
  261. references to static fields initialized at load time.
  262. Unboxed numbers (if used) can be compiled in
  263. the same way the Java compiler handles number literals.
  264. An extended Java interpreter can use other methods.
  265. </ul>
  266. Quoted S-expressions can be evaluated at class-loading-time,
  267. and a reference to the result stored in a class-static field.
  268. <h3>Symbol table information</h3>
  269. Loading a compiled Scheme program requires that definitions
  270. get elaborated, and inserted into the Scheme global symbol table.
  271. This can be done by the class's static initializer method,
  272. which gets evaluated at class-loading time. No special
  273. magic is needed in the .class file to declare the various
  274. functions and variables. However, if one has a system that
  275. does separate compilation with a module system, it may be
  276. desirable to extract type and symbol information from a
  277. previously-compiled module. This requires at least some conventions
  278. for how to access symbol information.
  279. <h2>Eval</h2>
  280. Implementing <code>eval</code>, <code>load</code>, and a
  281. <code>read</code>-<code>eval</code>-<code>print</code> loop
  282. can be done in various ways.
  283. <p>
  284. If the bytecode interpreter is inside a Scheme interpreter, then
  285. the latter presumably already has <code>eval</code>. We can use that.
  286. We would need gateways between the different interpreters, so
  287. different kinds of functions can call each other. To do that,
  288. the calling convention for Java methods should be based on the
  289. already existing Scheme calling convention. Even if we do
  290. a Scheme-in-Java system from scratch, it is not very difficult
  291. to write a simple interpreter for <code>eval</code>.
  292. <p>
  293. Alternatively, we can use bytecode for interpreted code too:
  294. Code that needs to be evaluated is compiled to bytecodes,
  295. and then immediately loaded and executed. Java provides a
  296. methods to load a class from a bytearray in memory, so it is
  297. not necessary to write out a temporary file. One problem is that
  298. Sun's current Java release does not garbage collect classes,
  299. so code that results from <code>eval</code> is never freed.
  300. This should seldom be a problem,
  301. except possibly in a long-running interactive session,
  302. since well-written Scheme code seldom uses <code>eval</code>.
  303. The problem can be reduced if we only byte-compile function
  304. definitions, but immediately evaluate other top-level expressions.
  305. And of course one can use a Java interpreter that <em>does</em>
  306. garbage collect classes.
  307. <p>
  308. The bytecode-based
  309. <a href="http://www.tkg.com/people/donovan/proj/rs/rscheme.html">RScheme</a>
  310. system implements <code>eval</code> by compiling and immediately executing
  311. bytecodes. It has support for multiple bytecode interpreters, and has
  312. immediate ("raw" or "unboxed") numbers, so it would probably be a strong
  313. candidate for a combined Scheme/Java system.
  314. <p>
  315. Finally, it does not take much code to write a simple Scheme interpreter,
  316. This can be byte compiled, and linked in
  317. with the rest of the system. This makes most sense in
  318. a Scheme system that emphasises large applications, and uses
  319. a compiler that does global optimization,
  320. such as <a href="#Stalin">Stalin</a>.
  321. </body></html>