123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345 |
- <html> <head>
- <title>Byte-compilation of Scheme using Java byte-codes</title>
- </head>
- <body>
- <h1>Byte-compilation of Scheme using Java byte-codes</h1>
- <address>
- <a href="http://www.cygnus.com/~bothner">Per Bothner</a>
- <a href="mailto:bothner@cygnus.com"> bothner@cygnus.com</a>
- </address>
- May 1996
- <p>
- <em>(This design document discusses the issues of implementing
- Java in Scheme. It predates the implementation, and does not
- necessarily match the current Kawa implementation.)</em>
- <p>
- Our biggest outstanding deliverable for Guile is a byte-compiler
- and -interpreter. Using Java has been suggested, and there is
- wide-spread interest in using Java byte codes for other languages.
- For example, <a href=http://www.inmet.com"">InterMetrics</a> is
- porting their <a HREF="http://www.inmet.com/~stt/adajava_paper/">
- Ada95 compiler to emit Java byte-codes</a>.
- See <a href="http://www.cs.tu-berlin.de/~tolk/vmlanguages.html">here</a>
- for more languages that use Java.
- <p>
- There is already an implementation
- of Scheme on top of Java. This is "Kawa", written by
- <a href="mailto:sgml@winternet.com"> R. Alexander Milowski</a>.
- There are some major problems with Kawa 0.1, and it needs a lot
- of re-writing, but it may be a useful starting point.
- <p>
- There are some different possible targets for Scheme-in-Java:
- <ul>
- <li>A Scheme byte-code compiler with support libraries that will run on
- any Java implementation. The support libraries will implement
- the Scheme primitives. For maximum portability, it should be
- written in Java, but it is also possible to use "native" methods
- (i.e. written in C).
- <li>The same, but we target a specific Java implementation that has
- some extensions to support Scheme better.
- <li>A byte-code compiler that targets a Java interpreter that is
- implemented on top of a Scheme system. For example, we could target
- the (very incomplete) "Latte" sub-system of Guile.
- <li>A hybrid system: Depending on compiler switches, it could emit either
- portable bytecodes or Scheme-enhanced bytecodes. The latter might
- be either more efficient or provide features not available in the
- portable Scheme-in-Java implementation.
- </ul>
- <p>
- Related design issues include:
- <ul>
- <li>Execution efficiency.
- <li>Class file size and loading speed.
- <li>Need for general <a href="#Tail-calls">tail-call-elimination</a>.
- If general tail-call-elimination is required, then it is not possible
- to produce portable bytecodes with a direct translation, though it is
- possible to use Henry Baker's trick (a function returns its continuation
- to its caller), which is also used in RScheme.
- <li>Need for full or restricted <a href="#Continuations">continuations</a>.
- It is not possible to implement general <code>call/cc</code> using
- direct translation to portable bytecodes. It is possible if you re-write to
- continuation-passing style.
- Restricted call/cc that only returns to a caller is possible.
- </ul>
- <p>
- <h2>Data types</h2>
- One issue how to represent Scheme values using Java types.
- Because of Scheme's dynamic typing, we represent Scheme values
- using Java object references. Scheme objects inherit from
- the <code>Object</code> class, like all Java objects.
- One might add an intermediate class <code>Scheme_Object</code>,
- and have all Scheme types be derived from it. That is possible, but I
- don't think it is a good idea. I think it is better to re-use existing
- Java classes. For example the Scheme values <code>#f</code> and
- <code>#t</code> should be the Java objects <code>Boolean.FALSE</code>
- and <code>Boolean.True</code>. This does imply
- that one cannot use that standard Java print functions (since they
- would print in Java syntax, rather than Scheme syntax).
- <p>
- Traditionally, Scheme (and Lisp) implementations represent objects
- using a short union, with special bit "tags" to distinguish heap
- objects and immediate values (such as fixnums). This allows
- some operations (including fixnum arithmetic) to be more efficient.
- Unfortunately, there is no way to do this with Java bytecodes.
- <a href="mailto:shivers@ai.mit.edu">Olin Shivers</a> proposes a Java
- <a href="http://www.ai.mit.edu/people/shivers/javaScheme.html">
- extension to allow tagged types</a>.
- I don't think this has much chance of being standardized,
- and I think pre-allocation of common values (characters and
- small ints) can give some of the benefits without changing the VM.
- <p>
- Here are my suggestions for representing the various Scheme types:
- <dl>
- <dt><code>#f</code> and <code>#t</code>
- <dd>Use the standard Java class <code>Boolean</code>.
- <dt>characters <dd> Use standard Java class <code>Character</code>.
- Two minor problems:
- <ul>
- <li>Java only supports 16-bit Unicode characters. It is probably possible
- to make a Java system that can support more characters, while still
- supporting standard Java bytecode files.
- <li>Character values would not compare <code>eq?</code>, unless we provide a
- system-wide "obarray" table for characters. Luckily, Scheme
- does not require characters that are <code>equal?</code>
- to be <code>eq?</code>.
- </ul>
- <dt>numbers <dd> Use the various standard Java classes.
- It may be reasonable to statically allocate say the integers -1 ... 100.
- Alternatively, one could use a global (weak-?) hash-table to ensure
- <code>equal?</code> integers compare <code>eq?</code>.
- <dt> pairs <dd> Add a new <code>Pair</code> class.
- <dt> () <dd> We could use the Java NIL value, or some dummy object.
- <dt> symbols <dd> Use the Java <code>String</code> class.
- Use <code>String.intern()</code> to make symbols unique.
- <dt> strings <dd> Use the standard Java <code>StringBuffer</code> class.
- <dt> vectors <dd> Use a Java array of <code>Object</code>.
- <dt> procedures/closures <dd> We have to use special objects.
- This merits further discussion (below).
- </dl>
- <h2>Procedures</h2>
- Most Scheme functions are fairly straight-forward to compile to
- Java byte-codes. However, there are some complications:
- <ul>
- <li>Scheme functions are first class <a href="#function-values">values</a>,
- that can be passed around as objects.
- <li>Nested functions may require <a href="#closures">closures</a>
- to be created.
- <li><a href="#Tail-calls">Tail-calls</a> are required to not cause stack
- growth.
- <li>First-class <a href="#Continuations">continuations</a> may
- require the stack state to copied.
- </ul>
- <p>
- Java does not have <a name="function-values">function values</a>.
- They can be simulated with a new
- <code>Procedure</code> class with a (virtual) <code>apply</code> method.
- Each function gets implemented as a separate sub-class of
- <code>Procedure</code>, and the Scheme code of the function is
- compiled into an <code>apply</code> method for that
- <code>Procedure</code> sub-class.
- <p>
- In some cases, a number of functions can use the same <code>Procedure</code>
- sub-class. For example, the <code>c[ad]*r</code> functions could be
- implemented with a single <code>Procedure</code> sub-class, with
- a field containing an encoding of the <code>[ad]*</code>.
- The <code>apply</code> method uses that field to do the right thing.
- <p>
- A <a name="closures">closure</a> requires a <code>Procedure</code>
- sub-class with a field that contains the static context.
- The static context can be an array,
- though it can also be an <code>Object</code> of some suitable class.
- Evaluating the nested lambda expression will allocate a context object,
- and then "new" the <code>Procedure</code> sub-class, passing it the
- context object. The <code>apply</code> method well get non-local
- bindings using the context object.
- <h2><A NAME="Tail-calls">Tail call elimination<a/></h2>
- The most important use of tail-calls is probably tail-recursion, where
- a procedure calls itself (by name). This is easily compiled using
- a Java GOTO byte-operation. (This is more difficult if translating
- to Java source code, since Java does not have a goto statement, though
- Java byte-codes do.)
- <p>
- Mutual tail-calls among multiple functions compiled together can be
- handled by compiling all the functions in one big procedure, and
- and starting it with a switch operation to jump to the code for the
- desired function. In that case, a tail-call to one of the other
- functions can easily be compiled as a GOTO. It is not clear whether
- this is worthwhile, given that it still does not get you general
- tail-call elimination.
- <p>
- General tail-call elimination can be implemented using Baker's
- trick (a function returns the address of the next function to call),
- but this constrains you calling convention strongly, and for various
- reasons is messy and undesirable in a Java context.
- <p>
- The ideal way to support general tail-call elimination is to just
- do it in the Java interpreter (either for all tail-calls or using
- a new operation). The more optimized and tuned the interpreter
- is, the more likely is that that this will be tricky and
- machine-dependent, since it may involve messy adjustments
- to the C stack. It has been proposed that gcc be enhanced to
- support a new (extra) calling convention such that the <em>called</em>
- function be responsible for popping the arguments of the stack (rather
- than the caller). C programs could specify an <code>__attribute__</code> to
- indicate that a function should use the new calling convention (thus
- regular functions can call functions using the new convention, and
- vice versa). If the Java interpreter is carefully written to use
- the new calling convention where appropriate, then it should be fairly
- easy to implement general tail-call elimination. If Scheme code is
- compiled to native (C or assembler), it can also use the new calling
- convention, thus getting full tail-call elimination. The FSF
- (i.e. Stallman) is favorable to this approach, which would be
- quite a bit of work, but would be generally useful to anyone who
- needs general tail-call elimination.
- <h2><a name="Continuations">Continuations</a></h2>
- Implementating continuations (specifically the <code>call/cc</code>
- function) may be straight-forward in some environments. However, it
- cannot be implemented in Java, nor can it be implemented in portable
- C code.
- <p>
- One option is to not worry about fully implementation continuations.
- Most uses for continuations are to implement exception handling,
- threads, and back-tracking. Exception handling and threads are
- already provided by Java. Continuations used to implement
- (non-resuming) exception handling are a special case of continuations
- that only return to stack frames in the current call chain; these
- can be implemented using a combination of <code>Continuation</code> objects and
- exception handling. Guile already supports threads (not implemented
- using continuations), as does Java; using continuations to switch
- threads in such an environment is probably a bad idea. We do not
- have a ready interface for back-tracking, except by using continuations,
- but I don't think many Scheme applications use back-tracking.
- <p>
- If our Java-byte-code interpreter is embedded in a Scheme system that
- already supports continuations, then it is trivial to implement
- <code>call/cc</code>.
- <p>
- An alternative is to have the compiler transform a Scheme program
- to a form which does not need explicit continuations. One option
- is to translate to continuation-passing-style. I don't understand
- this will enough to comment on the implications.
- <p>
- <a name="Stalin" href="http://tochna.technion.ac.il/~qobi"> Jeffrey Mark Siskind</a>
- is working on a highly optimizing compiler to translate Scheme to C.
- It could possibly be re-targeted to emit Java byte-codes. However, it is
- written for for a "closed-world assumption" with global analysis of the whole
- program. There is as yet no support for separate compilation of independent
- modules, though that is supposedly planned. In any case, if highly-optimized
- Scheme code is important, there are probably ideas (and code)
- we can use in Stalin.
- Also, <a href="http://www.ccs.neu.edu/home/will"> William Clinger</a>
- has expressed plans to re-target his
- <a href="http://www.ccs.neu.edu/home/will/Twobit"> Twobit</a>
- optimizing compiler to
- emit Java byte-codes, and he believes it should be easy to do.
- <h2>File format</h2>
- A Java .class file contains more than just code.
- It also includes symbol-table information, constants, and other meta-data.
- We need to consider how this works for Scheme.
- One point to note is that a Java .class file contains the compilation
- of only a single class. I believe it is possible to compile a
- Scheme source file (or module) and only write a single .class file.
- However, that would restrict your implementation choices a bit,
- and otherwise warp the resulting structure, so it may be
- prefereable to generate multiple classes. This will require the
- use of multiple "helper" .class files, which is clumsy. It may
- be reasonable to extend the .class format to allow nested classes.
- <h3>Constants</h3>
- Scheme code includes constants of two kinds: Simple literals,
- and compound quoted S-expressions.
- Simple literals occurr in
- other languages (including Java), and are handled similarly:
- <ul>
- <li>Standard constants (such as <code>#f</code> are compiled into
- references to static members of standard classes
- (such as <code>Boolean.FALSE</code>).
- <li>Quoted Symbols can be compiled the same way the Java compiler
- handles string literals.
- <li>Strings and vectors be be compiled into references to static fields
- that get initialized at class-loading-time.
- <li>Character and integer literals can also be handled using
- references to static fields initialized at load time.
- Unboxed numbers (if used) can be compiled in
- the same way the Java compiler handles number literals.
- An extended Java interpreter can use other methods.
- </ul>
- Quoted S-expressions can be evaluated at class-loading-time,
- and a reference to the result stored in a class-static field.
- <h3>Symbol table information</h3>
- Loading a compiled Scheme program requires that definitions
- get elaborated, and inserted into the Scheme global symbol table.
- This can be done by the class's static initializer method,
- which gets evaluated at class-loading time. No special
- magic is needed in the .class file to declare the various
- functions and variables. However, if one has a system that
- does separate compilation with a module system, it may be
- desirable to extract type and symbol information from a
- previously-compiled module. This requires at least some conventions
- for how to access symbol information.
- <h2>Eval</h2>
- Implementing <code>eval</code>, <code>load</code>, and a
- <code>read</code>-<code>eval</code>-<code>print</code> loop
- can be done in various ways.
- <p>
- If the bytecode interpreter is inside a Scheme interpreter, then
- the latter presumably already has <code>eval</code>. We can use that.
- We would need gateways between the different interpreters, so
- different kinds of functions can call each other. To do that,
- the calling convention for Java methods should be based on the
- already existing Scheme calling convention. Even if we do
- a Scheme-in-Java system from scratch, it is not very difficult
- to write a simple interpreter for <code>eval</code>.
- <p>
- Alternatively, we can use bytecode for interpreted code too:
- Code that needs to be evaluated is compiled to bytecodes,
- and then immediately loaded and executed. Java provides a
- methods to load a class from a bytearray in memory, so it is
- not necessary to write out a temporary file. One problem is that
- Sun's current Java release does not garbage collect classes,
- so code that results from <code>eval</code> is never freed.
- This should seldom be a problem,
- except possibly in a long-running interactive session,
- since well-written Scheme code seldom uses <code>eval</code>.
- The problem can be reduced if we only byte-compile function
- definitions, but immediately evaluate other top-level expressions.
- And of course one can use a Java interpreter that <em>does</em>
- garbage collect classes.
- <p>
- The bytecode-based
- <a href="http://www.tkg.com/people/donovan/proj/rs/rscheme.html">RScheme</a>
- system implements <code>eval</code> by compiling and immediately executing
- bytecodes. It has support for multiple bytecode interpreters, and has
- immediate ("raw" or "unboxed") numbers, so it would probably be a strong
- candidate for a combined Scheme/Java system.
- <p>
- Finally, it does not take much code to write a simple Scheme interpreter,
- This can be byte compiled, and linked in
- with the rest of the system. This makes most sense in
- a Scheme system that emphasises large applications, and uses
- a compiler that does global optimization,
- such as <a href="#Stalin">Stalin</a>.
- </body></html>
|