|Column Tag:||Threaded Code: Mach 1
A Multi-tasking Forth System
By Jörg Langowsk, EMBL, C/O I.L.L., Grenoble, Cedex, France, MacTutor Editorial Board
"Subroutine threaded Forth - the Mach 1 system"
This time I am going to introduce to you a new Forth-83 system that I recently got to review. This system is implemented in a somewhat different way compared with MacForth, NEON or other Forths, so let me digress a little and make some comments about Forth implementations on 68000 based systems.
As most of this column's readers know, 'threaded code' means program code that is organized in a very hierarchical manner; a 'main program' or colon definition consists of a list of words that describe other lower-level definitions that in turn are lists of even lower level definitions and so on ... until one arrives at the definitions of the very lowest level that are directly executable machine code (See Figure 1 below).
How do Forth systems implement this structure? The definition of every Forth word, colon definition, variable, etc. contains one short piece of executable machine code at the very beginning. This is called the code field and determines in which way the information following it is handled. For example, in a colon definition in MacForth, the code field contains a TRAP #F (V2.0) or TRAP #E (V2.4) instruction that jumps to code that in turn:
-pushes the instruction pointer (A3) on the return stack,
-sets the instruction pointer to the address of the first word of the new definition to be executed,
-jumps to the NEXT routine which executes the tokens that follow the code field.
Mac Forth is an example of 'token threaded Forth' which means that the numbers following the code field are not valid addresses of other Forth words, but 'tokens' that are to be converted into addresses by some algorithm. In Mac Forth, positive token values are interpreted as offsets from a base pointer (A4), while negative token values are negative offsets from the top of application space (A5) into a token table, where the 32-bit address of the Forth word is found. The token table is used to handle programs that exceed the 32 K address space that can be handled through the first method of addressing.
The advantage of MacForth's and similar structures is that we need only 16 bits for each word compiled into a definition. The disadvantage is the limited address space (32K) for the simple A4-indexed addressing mode and the need for a token table to convert tokens into addresses that lie outside the 32K addressing region. Time is lost on each NEXT loop by having to test whether the token is negative or positive, then eventually by accessing the token table ( See Figure 2 ).
The Forth implementation that was used to create NEON does not need a token table since each word uses 32 bits, but more space is used in the compiled definitions that way.
Fig. 3 gives a summary of some possible addressing modes for Forth interpreters with their execution times (in system clocks, from the MC68000 manual). Some of these examples have been taken from a Dr. Dobb's Journal article by Joe Barnhart, 'Forth and the Motorola 68000', DDJ 83 (1983) 18-26.
MacForth does not conform to any of the 'pure' definitions here, but (cf. Fig. 2) is a mixture of a base-offset direct word and base-offset indirect long threaded code (with an additional branch required).
You see that in all the examples given the inner interpreter requires considerable overhead to 'set up' the forth words for execution.
Additional overhead is created by the execution of the small piece of machine code starting at the CFA. In MacForth this is a TRAP instruction (38 cycles) plus three lines of assembly code that the trap vector points to (32 cycles). Counting everything up, execution of one colon-defined Forth word requires 100 or 120 cycles under MacForth, depending whether or not the token table has to be accessed. In a system like NEON, the CFA does not contain a trap, but a JSR to a routine that sets up the inner interpreter, with a similar amount of time involved.
One solution around this problem is to code time-critical things in assembly language, so that the whole definition is made up of executable 68000 code, starting with the code field. I have given an example for this in MT V1#9 by defining macro words that compile inline code into a definition. The speedup was of the order of 50% for MacForth. Of course, if you write inline macros, you are not really creating threaded code, but linear code, and the size of the definition tends to get rather large.
Still, there is another way to create threaded, i.e. hierarchical, code that does away with the needs for an 'inner interpreter' loop completely, since the threaded code will be directly executable 68000 machine language. This sounds strange, but is actually very simple to achieve. The keyword is 'subroutine threading'. As you saw from the inline code example, we can do without a 'code field' by just starting to execute our definition at the very beginnning, provided it is all bona fide 68000 code. Now, all we have to do is to persuade the Forth system not to compile a list of addresses into a colon definition, but a list of JSR (addr). This way, we don't need the inner interpreter at all, Forth instructions are 68000 instructions; the subroutine return stack, in this case, is the normal 68000 stack, pointed to by A7 (Mach 1 uses still a different stack for loop indices, more about that later).
The speed increase over the previous examples is considerable, with only 18 cycles of overhead for the JSR instruction if the jump is taken within the segment, or 30 cycles for one more .JMP if a jump table has to be used to branch to a different segment. This is one-fifth to one-third of the overhead of any of the other implementations given on the next page in Fig. 3.
Actually, subroutine threading gives us a Forth compiler that creates native 68000 code, which won't even need a runtime Forth nucleus, but will nicely execute by itself.
Drawbacks of this strategy: A7 cannot be the Forth data stack pointer any longer, some other register has to be used. This does not really create a problem as long as one stays within Forth; the 68000 couldn't care less which stack pointer you use for passing arguments between routines. Toolbox routines, however, expect their arguments on the A7 stack. Their calling sequence, therefore, becomes a little bit more complicated - and more time consuming - in a subroutine-threaded Forth implementation, since all the arguments have to be transferred from the data stack to the A7 stack first, then any result has to be swapped back from the A7 stack to the data stack. Also, since toolbox calling should be transparent to the user, the Forth compiler should know about the parameters that all the toolbox traps expect and create the 'glue' code automatically.
Subroutine-threading, considering all the pros and cons, is certainly one of the more elegant ways to implement Forth. Given a good compiler, the user will see no difference to a 'classical' Forth, except that the resulting code runs faster by a factor of (approximately) two. The 68000, with its ability to use any address register as a stack pointer, is especially suited for this kind of approach.
Such were my thoughts at the end of last year, and I had started doing some experiments implementing my own Forth on the Mac, which was going to be subroutine threaded. That was when I received a letter which described a system that did all the things that I just mentioned, and some more. With the effect that I realised that perhaps in a year or two my own efforts would get me to where others had gone already. Too bad.
The letter came from a company called Palo Alto Shipping, and they had developed a system called Mach1, a subroutine threaded Forth-83, multi-tasking above all. After a long introduction, let us jump right in the middle of things with a description what Mach1 is and what it can do.
The Mach1 system
With my evaluation package I got one disk and a deceptively small handbook. Opening the latter revealed a phrase on page IV that I wish to see on more software products: All rights reserved. Nothing more, no legalese for which you need to take courses to understand, just a simple copyright notice as you would find it in any serious textbook. Being a strong adherent of Jerry Pournelle's policy concerning Silly Licensing Agreements, I found this part of the manual extremely sympathetic. Later, it is explicitly said that you may make as many backup copies as you like and use them - one at a time only - on any machine you like. Good.
The handbook gives a short introduction to Forth and refers beginners to Starting Forth for a good tutorial. The system is Forth-83, with some additions. Of course, these 'additions' are where the fun comes in, so here are the main features that make Mach1 comfortable:
-program loading from normal text files, no blocks.
-local variables in a local stack frame created by the LINK/UNLK instruction.
-a floating point package for access to the SANE library. Unfortunately, no fast 32-bit floating point (only Fortran has it so far).
-words that give easy access to the MacinTalk speech synthesizer, which comes with the system.
-vectored input/output (e.g., if you really want, you could use the system from a serial terminal).
-an assembler that recognizes standard Motorola syntax 68000 code,
-a debugger/disassembler/decompiler that can be used like Macsbug and recognizes almost the same commands, but at the same time decompiles Forth words.
-the possibility to save the system with all definitions that you added so that you can quit and resume work without having to reload everything. This is called 'snapshot' in MacForth and 'workspace' in Mach1.
-a very easy way to create stand-alone 'turnkey' applications which contain a stripped-down version of the Forth runtime system and are not too large. These applications may be sold without any additional licensing fees.
-true multi-tasking with an arbitrary number of tasks.
The toolbox is accessed through the word CALL which (on compile) looks up the necessary parameter setup for the trap, swaps them from the data stack (A6) to the subroutine stack (A7) and calls the trap. Three traps cannot be called, UnlodeSeg and LodeSeg - this is understandable because of the heavy use of the segment loader by Mach itself - and TECalText, which I wasn't able to figure out why.
For the creation and management of windows, controls, and menus, sufficient support is given through high-level words in the Mach1 system. There is a 'templates' menu with which you can define a window's characteristics in a dialog box, and the Forth commands to create that window will be automatically copied to the clipboard; same for menus and controls. Like in NEON, you may load from the clipboard.
Mach1 is packaged with the switcher and MDS Edit. You're supposed to set up the system in a way that you can switch between Edit and Mach1 for development. I found myself pretty soon trashing Edit and using the Editor desk accessory for writing programs, which overall is faster, though Edit does a better job of formatting.
Multi-tasking under Mach1
Before we get to the program example, let's take a quick look at the multi-tasking features of Mach1. You will soon see that multi-tasking is so essential in this system that the style of programming in Mach1 will be quite different from other Forths.
Under Mach1, you may define tasks. Each task has some private stack and user variable space assigned. Tasks are arranged in an endless chain, with each task containing a pointer to the next task in the chain at the base of its user variable area. Just below this pointer is a word that contains the status of the task, SLEEP or WAKE. This word actually is executable 68000 code; if the task is aSLEEP, the code looks like
4EF9 XXXX XXXX JMP next task,
if it is aWAKE, it is
4E40 ---- ----TRAP #0 .
So tasks that are dormant will just be skipped by a jump to the base address of the next task; active ones execute the trap instruction, which falls into a routine that restores the task's register file and continues where it quit the last time. Any task will execute until the word PAUSE is encountered, at which point the local registers will be saved and execution transferred to the next task.
PAUSE can be compiled into a task's code at strategic points, for example in a loop that is part of a lengthy calculation. Also some built-in words (mainly I/O, like EMIT) contain a PAUSE. The requirement for the task to call the scheduler, instead of the scheduler switching tasks automatically, is the only thing that distinguishes Mach1 from a full implementation of multi-tasking. So multi-user would be not practical, because if one user entered into a pauseless loop, the others would be cut off; but for single-user multi-tasking this system works very well.
I have written a background task that acts as a printer spooler for this month's example. One of the problems with Mach1 is multiple file handling; each task can only handle one open file at a time (from how I understood the manual). Therefore, multi-tasking comes in automatically if one wants to work with several open files (I've heard that this is done sometimes...).
Another drawback is that the file handling routines cannot accept a file name string as an external parameter; it is compiled into the definition that calls the file routine. Therefore, when a new spool file is opened, my program resorts to a kludge, storing a unique 4 digit decimal number into the appropriate position in the word's code (which I was able to find with the debugger).
The example below creates a task, spool_task, that executes spool in the background. This is an infinite loop that looks for the numbers of spool files contained in a list; if the list is non-empty (list.busy), it will open the spool file with that number and print it until it encounters a zero byte (end-of-file for a text file). The spool file is not deleted, so your disk might fill up rather quickly; reason is that there is no predefined word for deletion of files in Mach1, and I was too lazy to write one.
If you have Mach1, load this program, create a second active window, like the "Skylight" window in the example on page vi of the manual. Then say new_spool in the main window, which will create a new spool file, open it and redirect the output to both the file and the window. Click the second window and do the same there. Now you can create output in one of the windows, e.g. by ?FREE or WORDS, let the thing go and do some output in the other window. When you close_spool in any window, the background task will automatically start printing the spool file while you can continue doing (almost) whatever you like in one of the windows.
The famous Erastothenes Sieve (see back issues of this magazine, e.g. V1#9) executes in 12.8 sec (in my hands). This is almost twice as fast as MacForth or NEON (both slightly above 20 sec) and only twice as slow as compiled C of average quality.
The Assembler contained in Mach1 accepts standard Motorola syntax. Words defined under Mach1 may be used in the operand field of assembly statements, e.g. in a JSR. One important feature is the automatic creation of inline code; when you say MACH after defining a word, the compiler will insert the code comprising the word into other definitions, instead of compiling a JSR to the code's address. Simple words like DUP and SWAP are defined this way; when you disassemble a definition containing these words, you will find no JSRs, but code that does DUP and SWAP directly.
This review, by no means complete, should have given you a flavor of Mach1 and what can be done with it. I am impressed of the speed of the code created; an easy-to-use assembler and debugger and the MACH inline code feature help speeding up the system even more. For speed considerations, a fast 32-bit floating point package would be desirable.
The multi-tasking feature is so easy to use that one often splits up programs into tasks when one wouldn't have thought about doing so in other languages. Due to the intrinsically high speed (subroutine threading) the scheduler also doesn't create too much overhead, if one uses PAUSE with consideration.
The vocabularies that come with the system are rather small compared with other systems; nevertheless, full Forth-83 has been implemented and all the necessary things are there. With the MACH macro feature, many of the 'combined' words such as IC! are not needed anymore. The existing set of words is explained very thoroughly in the manual, going into depths of the assembly coding where necessary. Dictionary structures, memory maps, task structures all are explained in detail.
Access to the Macintosh toolbox routines is convenient, and the template facility to set up standard windows and controls comes in very handy. There is even a special kind of scroll bar that will automatically resize with the window.
Stand-alone application may be created very easily, spanning several code segments if necessary. The applications need no run time kernel or library to run, which is nice (in fact, I consider it almost essential). The right to sell Mach1-created applications is contained in the purchasing price of $49.95.
Still, there are some bugs in the system. The debugger will work only once with the debugging switch, using the switch a second time crashed the system (in my hands). Also, some instructions are not disassembled correctly. Those bugs (and probably others that I haven't found) are nothing more and nothing less than one would expect for a freshly released system and I expect them to be fixed soon (upgrade charge $5).
The file handling system is not as convenient as it could be. Multi-tasking makes up for many of the inconveniencies, but not for all of them. A reasonable set of open/close/position/delete routines that accept stack input should be included. Also, the TECaltext trap should be accessible through the normal CALL mechanism. Finally, a fast floating point package would make a real nice addition.
All in all, I can highly recommend Mach1 for both Forth beginners and developers. What intrigues me most is that with a rather small vocabulary one has a system at hand that makes near optimal use of the Macintosh's special features and is fast, too. One last recommendation to Palo Alto Shipping: 20 instead of 2 pages of index would improve the handbook dramatically.
( Mach 1 background printer spooler )
( © J. Langowski / MacTutor Feb. 1986 )
only mac also i/o also forth
: wkg_spool working" spool.####" ;
: cre_spool create" spool.####" ;
: use_spool using" spool.####" ;
variable printlist 400 vallot
190 user active.spool
: file# filecount @ ;
: print# print.ptr @ 4 - @ ;
: incr.file file# 1+ filecount ! ;
: incr.print print.ptr @ 4 + print.ptr ! ;
: decr.print print.ptr @ 4 - print.ptr ! ;
: reset.file 1 filecount ! ;
: reset.print printlist print.ptr ! ;
: add.spool ( n - )
printlist print.ptr @
do i 4 - @ i ! -4 +loop incr.print
printlist ! ;
: list.busy print.ptr @ printlist <> ;
11 constant ##offset
( offset from beginning of file routine to file name )
: wkg_spool# ( n - ) <# # # # # #> ( four digits )
['] wkg_spool ##offset + swap cmove
( change file name, append number )
: cre_spool# ( n - ) <# # # # # #>
['] cre_spool ##offset + swap cmove cre_spool ;
: use_spool# ( n - ) <# # # # # #>
['] use_spool ##offset + swap cmove use_spool ;
file# cre_spool# file# wkg_spool#
file# active.spool !
." Your spool file is spool." file# . cr
file output 12 emit ( form feed )
." SPOOL FILE # " file# . cr
console file + output
: close_spool console output close-file
active.spool @ add.spool ;
: print_file 0
begin dup virtual c@ dup while emit 1+ repeat
print# decr.print use_spool# disk 4 + w@
not if print_file close-file
else 10 call sysbeep then
400 1000 background spool_task
: spool activate
1cc0a mode2 comm2 output
begin print_spool pause again