In yet another ‘probably-useless-but-interesting’ hobby project, I wrote a Forth
compiler and interpreter targeting WebAssembly.
It’s written entirely in WebAssembly, and comes with a compiler
that dynamically emits WebAssembly code on the fly. The entire system (including 80% of all
core words) fits into a 10k (5k gzipped) WebAssembly module. You can try it out here, or
grab the code from GitHub.
What follows are some notes on the design, and some initial crude speed benchmarks.
Forth is a
low-level, minimalistic stack-based programming language. It typically comes in the
form of an interactive interpreter, where you can type in your commands. For example,
taking the sum of 2 numbers and printing the result:
2 4 + . 6 ok
Forth environments also have a compiler built-in, allowing you to define new ‘words’
by typing their definition straight from within the interpretter:
: QUADRUPLE 4 * ;
which you can then immediately invoke
2 QUADRUPLE . 8 ok
Not unlike Lisps, you can customize Forth’s compiler, add new control flow
constructs, and even switch back and forth between the interpreter and the compiler
Because of its minimalism, Forth environments can be easily ported to new
instruction sets, making them popular in embedded systems. To learn a bit more
about this language (and about WebAssembly), I wanted to try creating an
implementation for WebAssembly – not exactly an embedded instruction set, but
an instruction set nonetheless.
WAForth is (almost) entirely written in WebAssembly. The only parts for which
in WebAssembly), and the I/O primitives to read and write a character.
I got a lot of inspiration from jonesforth, a
minimal x86 assembly Forth system, written in the form of a tutorial.
The Macro Assembler
Update (11/2019): WAForth no longer uses a custom macro assembler; the core is now written
entirely in raw WebAssembly.
The WAForth core is written as a single module in WebAssembly’s text format. The
text format isn’t really meant for writing code in, so it has no facilities like a real assembler
(e.g. constant definitions, macro expansion, …) However, since the text format uses S-expressions,
you can do some small tweaks to make it loadable in a Lisp-style system, and use it to extend
it with macros.
So, I added some Scheme (Racket) macros to the module definition,
and implemented a mini assembler to print out the resulting s-expressions in a compliant WebAssembly format.
The result is something that is almost exactly like a standard WebAssembly
text format module, but sprinkled with some macros for convenience.
The interpreter runs a loop that processes commands, and switches to and from
Contrary to some other Forth systems, WAForth doesn’t use direct threading
for executing code, where generated code is interleaved with data, and the
program jumps between these pieces of code. WebAssembly doesn’t allow
unstructured jumps, let alone dynamic jumps. Instead, WAForth uses
subroutine threading, where each word is
implemented as a single WebAssembly function, and the system uses calls and
indirect calls (see below) to execute words.
While in compile mode for a word, the compiler generates WebAssembly instructions in
binary format (as there is no assembler infrastructure in the browser). Because WebAssembly
doesn’t support JIT compilation yet, a finished word is bundled into a separate binary WebAssembly module, and
sent to the loader, which dynamically loads it and registers it in a shared
function table at the
next offset, which in turn is recorded in the word dictionary.
Because words reside in different modules, all calls to and from the words need to happen as
call_indirect calls through the shared function table. This of course introduces
As WebAssembly doesn’t support unstructured jumps, control flow words (
REPEAT, …) can’t be implemented in terms of more basic words, unlike in jonesforth.
However, since Forth only requires structured jumps, the compiler can easily be implemented
using the loop and branch instructions available in WebAssembly.
Finally, the compiler adds minimal debug information about the compiled word in
the name section, making it easier for doing some debugging in the browser.
The shell is a simple class
that loads the WebAssembly code in the browser,
provides the loader and the I/O primitives to the WebAssembly module to read and write characters to a terminal. On the other end, it provides a
run() function to execute a fragment of Forth code.
To tie everything together into an interactive system, there’s a small
console-based interface around this shell to type Forth code, which you can see
in action here.
Although I didn’t focus on performance while writing WAForth, I still wanted to have an
estimate of how fast it went. To get a crude idea, I ran an implementation of the
Sieve of Eratosthenes. I let the algorithm compute
all the primes up to 90 million on my MacBook Pro 2017 (3.5Ghz Intel Core i7) running Firefox 60.0.1, and timed different systems:
- WAForth: The sieve algorithm, written in Forth, running in WAForth. The words are compiled as separate WebAssembly modules, and all calls to and from the words are indirect (as described above).
- WAForth Direct Calls: The sieve algorithm, as compiled by WAForth, but inserted directly in the WAForth core, substituting all indirect call instructions from the previous version for direct calls. This measurement gives an indication of the overhead of indirect jumps.
- Gforth: The sieve algorithm running in Gforth Fast 0.7.3, a native, high-performance Forth.
- Vanilla WebAssembly: A straight WebAssembly implementation of the algorithm. This serves as an upper bound of how fast the algorithm can run on WebAssembly.
- Not surprisingly, JS-Forth comes out the slowest. It also runs into memory problems when trying to increase the 90 million limit. The other implementations have no problem (requiring only 1 byte per candidate prime).
- The indirect calls in WAForth cause ±50% overhead. Some of this overhead can be reduced when
WebAssembly starts supporting mutable globals, as
this will require less indirect calls for operations like loops and jumps.
- WAForth is 2× slower than the high-performance native Gforth
- The Vanilla WebAssembly version of the Sieve algorithm is much faster than the rest. Contrary to the WAForth version, this version
doesn’t need to go to memory for every piece of control flow, causing massive speed gains. This
is especially noticeable when increasing the number of primes: for all primes less than 500 million,
the vanilla WebAssembly version is up to 8 times faster than the WAForth one.
WAForth is still experimental, and lacks support for a few of the ANS
core words (although adding support for these shouldn’t be too much work). I also didn’t spend any time profiling performance to see if there are any
low-hanging fruit optimizations to be done (although I think most optimizations would complicate
the compiler too much at this point).