CRX - The Tiny Parser for Rexx

 

Preamble

You will have noticed that as people get older, they often try to relive the best times in their lives - they retire to the area where they grew up, they take up again hobbies of long ago, and so on. In this spirit, I have been revisiting some of the projects I have worked on.

Top of the list in terms of technology and calibre of colleagues was the "Virtual Memory Terminal" project of twenty years ago. "Terminal" in that title means a comparatively cheap screen and hardware given its powers by software loaded over a wire - today it would be called a Net Computer.

"Virtual Memory" meant that the people writing code to run in the terminal could be largely unaware of the physical configuration; the address space was large and the actual segments of code and data needed were loaded on demand from the host (nowadays known as the server). As well as the function-per-dollar benefits claimed, the issues of maintenance and control were trumpeted. At that time the "computer centers" that were responsible for the host computers were nervous about the advent of personal computers - they feared that if each person controlled their own computer's contents it would be difficult to ensure that people could work together and difficult to ensure that time was spent on business objectives rather than on the mechanics of computing. I reckon they were right to be nervous.

Apart from being in monochrome on a small screen, the applications in VMT were the familar text processing and graphics of today. My job in the team of four was to write the compiler so that the others could program in PL/I which, in terms of aspirations, was the Java of its day. The engine to be compiled for was a Motorola 6800 which even then only cost a few dollars, and was the forerunner of the 68000 series. There were no separate graphics cards in those days so the 6800, in addition to running the applications, wrote pel-by-pel on the screen by sending bits to special addresses - hard work for a puny engine.

You probably cannot buy a system with a 6800 in it these days, but the Intel 8086 was a similar contemporary and has progressed compatibly through 386, 486 and Pentium. So I chose to do my "coding for fun" on a Pentium PC. Writing assembler for DOS comes closest to the "bare bones" coding underlying VMT. Although the 16 bit DOS doesn't have virtual memory, it does have memory addressed as a collection of variable length segments, as VMT did. Rexx has affinities with PL/I and I have worked on the American National Standard for Rexx in recent years. Thus Rexx on DOS became the vehicle for my technology experiments.

Technology that makes code small does not have a great commercial value in these days of cheap memory, although one can imagine uses for it, for example when transmitting a program and something to execute the program together. The justification for choosing to work on making code small is mostly that the problem is interesting. You will have guessed that I no longer have to justify my time to a manager!

The Power of Backus-Naur Format

Here is a snippet from the American National Standard for Rexx, describing something that can occur in a PARSE statement:

    template_list := template | [template] ',' [template_list]
      template := (trigger | target | Msg38.1) +
        target := VAR_SYMBOL | '.'

There is no international standard for how BNF should be written but if you have a rough idea of what is being described you can deduce the formalities. Here the vertical bar is separating alternatives, square brackets enclose optional items, and the plus sign denotes repetition. The "Msg" notation is unique to the BNF for Rexx. It says "if all else fails to match, produce this message". Elsewhere in the standard the message is defined:

    #ErrorText.38.1 = 'Invalid parsing template detected at "<token>"'

The angle brackets denote an insert in the message, to be replaced by an actual token from the source program being parsed.

The Rexx standard actually uses two lots of BNF, one to describe how individual characters in the source program are grouped into tokens, like the VAR_SYMBOL above, and one to describe how tokens are arranged into programs. The use of BNF for the former problem is a convenience but somewhat over-engineered since (apart from nested comments) there is no need for recursion in describing tokens. The implementation of tokenizing is without much interest so I will concentrate on the BNF for the arrangement of tokens in programs.

The key question is "How far can the purely descriptive BNF be processed into a sequence of instructions for scanning programs". The first step to automation is to note that elegance in the BNF is nice for the human reader but a handicap to a utility program that aims to process the BNF. So a sensible first step is to expand away some of the shorthand BNF constructs. The snippet above becomes:

    template_list=template
    template_list=template ',' template_list
    template_list=template ','
    template_list=',' template_list
    template_list=','
    template=Q90
    Q90=Q56
    Q90=Q90 Q56
    Q56=trigger
    Q56=target
    Q56=Msg38.1
    target=VAR_SYMBOL
    target='.'

The new numbers in this are not significant; they are just used in generating unique names.

The utility programs like the one that does this expansion do not need to be efficient. There are not many subjects for them to process, unlike a compiler of source programs. Mine are written in 'C'.

I will be showing several bits of print-out from the utility programs as a way of explaining what those programs do; I should emphasise that the person who is making a parser does not need to make use of these print-outs. The parser maker simply supplies a BNF and makes use of the tables eventually produced by the utility programs.

If we consider one of the lines above in the context of left-to-right scanning of a sequence of tokens, then the scanning might have reached various stages in recognizing. Here the places are marked with a caret character:

    template_list=^ template ',' template_list
    template_list=template ^ ',' template_list
    template_list=template ',' ^ template_list
    template_list=template ',' template_list ^

If we consider all such positions over all the BNF we get a rough idea of all the states that the scanning of an arbitary source program can get into. Actually that is not the best definition of "all the states" because two states don't have to be regarded as different if what can validly follow them in the subject program is the same for the two states. There is a long-established theory about determining states which can be found in the textbooks on parsing. A utility program applying that theory shows that there are over 400 states for Rexx parsing. So the parsing problem can be restated as "what do we do with each sort of token when encountered in each of the states?". There are more than 100 sorts of tokens because keywords are tokens and most keywords have unique effects.

There are only three things that make much sense on encountering a particular token in a particular context:

  1. To declare the subject program is incorrect, that is it does not match the BNF.
  2. To decide the token corresponds to an advance in recognizing something (that is "moving the caret" in the examples above). This is called a "Shift" in the literature.
  3. To decide the token is beyond the end of something to be recognized, like the template_list above. This is called 'Reduction' in the literature.

In the first case parsing is over, in the others there is a new state. Here is how I have chosen to print out just a couple of the possible states:

  State number 100
  template_list=',' ^template_list
  template_list=',' ^
  (';' Msg21.1 )
    '('=>98 '+'=>99 ','=>100 '-'=>101 '.'=>102 '='=>103 NUMBER=>104
    STRING=>105 VAR_SYMBOL=>106 Msg38.1=>0 Q56=>107 Q90=>108
    absolute_positional=>109 pattern=>110 positional=>111
    relative_positional=>112 target=>113 template=>114 template_list=>236
    trigger=>116 vrefp=>117
  State number 102
  target='.' ^

State 100 occurs after a comma is encountered in a template list. The arrow signs point at the number of a new state. At the left of the arrow there is either a token, showing the shift for that token, or something which may be recognized. In the latter case the corresponding new state is entered when that thing is recognized.

The new state to enter when something is recognized may differ according to the context in which it was recognized so often information on the context has to be stacked so that the right choice can be made subsequently. The theory of parsing offers choices about exactly what to stack and how to use what is stacked.

The Aoe Method with a Novel Packing

I like the parser method which was developed by somebody called Junichi Aoe, along with others. The details can be found by in the references below. Broadly the algorithm is:

    Cycle:
        If the state is an error state, produce the message.
        If the state has any possibility of shifting go to Shifts
    Reduction:
        Remove from the stack an amount determined by this state.
        Work out from this state and the top-of-stack state what the new state will be.
        Go to Cycle
    Shifts:
        Consider the token. Is it one this state can shift on?
        If not go to Reduction
        Stack the state.
        Work out from the current state and the token what the new state is.
        Note that a new token will be needed for the next cycle.
        Go to Cycle

So with the BNF processed into states, and a parsing method chosen, the problem turns less to theory and more to implementation. How should the information be encoded so that the tests are made efficiently?

The test of acceptability of token at current state is a yes/no test, best handled by a bit array. If the information about the token includes an index for one dimension of this array, and the state information provides the other index, then this is a one instruction test on a Pentium. (The instruction set for the Pentium has an instruction for picking one bit out of a 32 bit field and there is a general method of using 32 bit instructions in 16 bit mode by adding a prefix byte to the instruction.)

The deduction of the new state from the current state and the latest token can also be an array but there is no reason why this array of state numbers has to be compact. If it is a sparse array the gaps between elements can be used to hold information about the states. (The information needed about a state varies from two to twelve contiguous bytes.)

The address at which the information about a state is stored serves as an identifier for the state, replacing the arbitary numbering produced by the utilities processing the BNF. This packing process that produces a solid chunk of "syntax table" from the fragments relating to individual states has a high degree of freedom in deciding the order of the fragments. It can also, if necessary, do a lot of trial-and-error to find a good layout. The challenging part is to find a packing approach which will allow most of the reductions, which are derivations of new state from current state and top-of-stack, to be done by arithmetic. The arithmetic may be new = current + stack on the addresses of the states or may be like IF stack > constant THEN new = some_state.

The packing algorithm which I developed can, no doubt, be improved on but it is adequate to reduce the syntax table size below four bytes per state. The output of the packing utility is a set of assembler code declarations. They get included into the hand coded part of the assembler code for parsing. The hand coded part has the cycle above and has declarations that map the fields in the utility-generated inclusion on to bits in words. There follows some of the packing utility output and one of the hand coded declarations. It is not intended to be self-explanatory, but just to give an idea of how generated and hand-coded parts are combined by the assembler.

    ShiftRec{1,0,0,0,1,0,1,2};    S13@470
    ErrorRec{1,1,35,1}
    ShiftRec{1,0,0,1,1,0,17,1};    S85@472
    word Keys85-R;375
    ErrorRec{1,1,25,16}
    RedRec{0,0,1,$DirectR,196}; S40
    word Action58-R
    RedRec{0,1,0,$ReferR,765}; S127    S120@477
    RedRec{0,0,0,$EqTest,574}; S196
    RedRec{0,0,0,$DirectR,554}; S128

    ErrorRec record HasShiftOn:1, ErrorAloneOn:1, MajorField:8, MinorField:6

The numbers after 'S' are the states as numbered for BNF processing, the numbers after '@' are the locations of the state information shown (and hence also the identification of the state during actual parsing.)

The outcome of all this, for the BNF in the standard, is a syntax table of just over 1700 bytes used by an algorithm implemented in 450 bytes. To these figures it is necessary to add some 600 bytes for the table of keywords and 200 bytes for doing lookup of keywords. How much further gets added for converting from characters to tokens is a matter of how tight the assembler code is but I have not got below 2400 bytes - there is a lot of work in checking various sorts of quoted literals, in the different ways of writing comparison operators, and so on.

Messages

The definers of the standard were able to be creative in the area of messages, since better messages pose no threat that existing valid programs might run in a different way under a Standard implementation. I think it is generally agreed there was good cause for adding message refinements - the Standard has 42 messages for different errors that in some non-Standard implementations would all be seen as Error 40. However, this gives rise to a cost in size (as well as in the detailed error detections that have to be implemented). The text of the raw prose, as in the example of the message 38.1 above, totals more than 20,000 bytes over all the messages. So without compression, the messages would be several times the size of the rest of parsing.

The algorithm used by PKZIP style compression is a good one - it achieves good compression, is quick in compression and decompression, and fairly simple to implement. Applied to the message text it compresses it to 4250 bytes. That is pretty good, but not best. Ideally we want easy decompression and we do not need quick compression or decompression, nor do we need easy compression. Also there is a small improvement that can come from knowing that the keywords in the message text are also keywords in the parser tables - the usage in the messages can be replaced by references to the tables.

With this set of objectives it is reasonable to consider "byte-encoding" schemes. These schemes start with the premise that the characters in the text, which will be letters, numbers and a few punctuation characters, will not contain as many as 256 unique characters. So some unused value in the range 0-255 can be used to represent a fragment of text, for example "must be ", and replacing all occurrences of that fragment with the previously unused byte value will result in a shorter text (even after allowing for the need to keep one copy of the fragment as the meaning of the new byte value). The shorter text represents a new problem which may again be amenable to shortening by the same approach.

After experimenting with various schemes for automatically choosing the fragments, and for reversing early choices that prove inferior to others, I believe the limit of this approach is just below 3700 bytes.

To this figure we must add the space needed for implementing the decompression algorithm and for replacing inserts in the message. There are many sorts of inserts - argnumber, bif, char, description, hex, keyword list, linenumber, name, operator, options, position, token and value. So 900 bytes of implementation to make use of the 3700 bytes of compressed messages.

Why Bother?

There is some practical value to establishing that parsing according to the Rexx standard is not a monster problem, and can be done in 10K bytes. The smallness is in general not an economic advantage - I compute that if it were twice the size then the extra capital cost of storage on a hard disk would be about a tenth of a cent. But if you like a technical challenge, I can recommend the search for smallness, which comes with frequent instant gratification as you make changes and monitor the next assembly.

BNF and Code Generation

Most parsers are not written just to answer the question "is the syntax of this program OK" but also as part of compiling or interpreting the program. Automated methods based on BNF are a good start to code generation as well as to syntax error detection.

As described above, parsing recognizes constructs, like template_list, occurring in a subject program. That recognition is the natural time to generate code to handle the construct at a subsequent execution time. There is a process called "decorating the BNF" that says "when this construct is recognized, run that bit of the implementation". Here is an example from my decorated version of the BNF from the Standard.

    prefix_expression := realprefix.11 | term | Msg35.1
      realprefix.11 := ('+' | '-' | '\') prefix_expression
    iterate := iterate.28 | iterate.29
      iterate.28 := 'ITERATE'
      iterate.29 := 'ITERATE' (VAR_SYMBOL | Msg20.2)

The non-Msg names with a number at the end mean "When the parser recognizes one of these it should invoke ActionNN, where NN is the number". The utility programs that process BNF make sure this happens by putting the Action label in the syntax tables. In this example Action11 will make code for a unary operation, Action28 will make a check that ITERATE is within a repetitive DO, and Action29 will work out which DO, if any matches, is being iterated. The same stack that is used for parsing can be used to hold the information about containing DO loops.

The BNF can be written, and the one in the Standard is, so that the order of recognizing things is the same as the order of execution when the program is executed. So for ALPHA+BETA*GAMMA the multiplication construct is recognized before the addition construct. This is also the order of operators if compiling to postfix Polish notation.

Overall, the extra implementation to go from parser to compiler is satisfying small. What Rexx source programs can best be compiled to is a whole different subject.

References

  1. J. Aoe, Y. Yamamoto and R. Shimada, 'Practical method for optimizing LR(k) parser using matrix structure',Systems Computers Controls, 10, (4), (1979)
  2. B Marks, 'Taming the PL/I syntax', Software Practice & Experience, 14, (8), (1984)