BASICO compiler source code reading - Parser 2

Author: Andre Adrian
Version: 09.jul.2006

func gettype()
func assign()
func pop()
func callid()
func call()
func exprList()
func callarg()
func expr()
func rel()
func addexpr()
func term()
func divmod()
func unaryfact()
func fact()
func push()
func pushvar()
func pusharrayvar()

The functions of this section produce most of the assembler code. The condition, arithmetric and logic operations in expr(), addexpr(), term(), unaryfact() and fact() produce "the working" opcodes like add and mul and a lot of transport opcodes (mov).

func gettype()

294    func gettype(offset: array int, name: array char): int
295    var id: int
296    begin
297    id = lookup(locals, name)
298    if id >= 0 then
299        offset[0] = localoffset[id]
300        return localtype[id]
301    endif
302    id = lookup(globals, name)
303    if -1 = id then
304        error("variable not found")
305    endif
306    return globaltype[id]
307    end

Function gettype() has two output values. The variable type is returned through the return value. The variable offset is returned through the call-by-reference array variable offset. The first lookup() for local variables is in line 297. The second lookup() for global variables is in line 302. This implements the expected local variables are used instead of global variables in case of identical name.
The variable type is coded into a single-character-representation. The coding is:

global integer	global integer array	local integer	local integer array	call-by-reference integer array
i	I	j	J	K

global char	global char array	local char	local char array	call-by-reference char array
c	C	d	D	E

func assign()

434    func assign(): void
435    begin
436    if 'i' = typevar then
437        printf("\tmovl\t%%e%cx, %s\n", reg[0], idvar)
438    else if 'j' = typevar then
439        printf("\tmovl\t%%e%cx, %d(%%ebp)\n", reg[0], offsetvar[0])
440    else if 'I' = typevar then
441        printf("\tmovl\t%%e%cx, %s(,%%e%cx,4)\n", reg[0], idvar, reg[1])
442        pop()
443    else if 'J' = typevar then
444        printf("\tmovl\t%%e%cx, %d(%%ebp,%%e%cx,4)\n", reg[0], offsetvar[0], reg[1])
445        pop()
446    else if 'K' = typevar then
447        printf("\tmovl\t%d(%%ebp), %%esi\n", offsetvar[0])
448        printf("\tmovl\t%%e%cx, (%%esi,%%e%cx,4)\n", reg[0], reg[1])
449        pop()
450    else if 'c' = typevar then
451        printf("\tmovb\t%%%cl, %s\n", reg[0], idvar)
452    else if 'd' = typevar then
453        printf("\tmovb\t%%%cl, %d(%%ebp)\n", reg[0], offsetvar[0])
454    else if 'C' = typevar then
455        printf("\tmovb\t%%%cl, %s(%%e%cx)\n", reg[0], idvar, reg[1])
456        pop()
457    else if 'D' = typevar then
458        printf("\tmovb\t%%%cl, %d(%%e%cx,%%ebp)\n", reg[0], offsetvar[0], reg[1])
459        pop()
460    else if 'E' = typevar then
461        printf("\tmovl\t%d(%%ebp), %%esi\n", offsetvar[0])
462        printf("\tmovb\t%%%cl, (%%esi,%%e%cx)\n", reg[0], reg[1])
463        pop()
464    endif; endif; endif; endif; endif; endif; endif; endif; endif; endif
465    pop()
466    end

Function assign() selects the correct transport opcode depending on variable type. The 80386 transport opcodes are quite powerful. The base_of_array + size_of_array_element * array_index calculation is done as part of the store mov opcode. Only in case of call-by-reference assignment, we need two opcodes and the CPU register %esi to done the indirect store.

func pop()

413    func pop(): void
414    var i: int
415    tmp: char
416    begin
417    se = se - 1
418    if se = 0 then
419        cacheinit()
420    else
421        i = 0
422        tmp = reg[i]
423        while i < 4 - 1 do
424          reg[i] = reg[i + 1]
425          i = i + 1
426        wend
427        reg[i] = tmp
428    endif
429    if se >= 4 then
430        printf("\tpopl\t%%e%cx\n", reg[i])
431    endif
432    end

The function pop() is one part of the stack-cache implementation, the other part is function push(). The real CPU stack is used with the popl opcode in line 430 if more then 4 stack elements (variable se) are needed. If pop() is called, the old top-of-stack element is dropped and the fourth-of-stack element is loaded from real stack to stack cache.

func callid()

359    func callid(): void
360    var i: int
361    begin
362    i = se
363    while i >= 1 do
364        i = i - 1
365        printf("\tpushl\t%%e%cx\n", reg[i])
366    wend
367    oldse = se
368    se = 0
369    strcpy(idcall, token)
370    args = 0
371    end

In the BASICO statement a = b + c(x, y) the stack cache is used for two purposes: First to code the assignment and the expression (the statement fragment a = b + ). Second to handle the function arguments of c(x, y). Function callid() pushes the stack cache to the "real" stack at the moment the parser sees c(, a function. The stack elements se is stored in variable oldse, se is set to 0 to init se for the second purpose. This implementation can not handle the following statements:
    srand(time(0))   use the time(0) function return value as argument for function srand()
    a = c(x, y) + d(x, y, z)   handle two functions with different number of arguments.

func call()

381    func call(): void
382    var i: int
383    begin
384    // C pushes args from right to left
385    i = 0
386    while i < se do
387        printf("\tpushl\t%%e%cx\n", reg[i])
388        i = i + 1
389    wend
390    printf("\tcall\t%s\n", idcall)
391    if args > 0 then
392        printf("\taddl\t$%d, %%esp\n", 4 * args)
393    endif
394    end

The C library functions expect the function arguments on the stack in the order that the left-most argument is the top-of-stack argument. The arguments have to be pushed right-to-left on the stack. This requests normally a two-pass parsing of the function arguments. First pass to get the number of arguments, second pass to produce the assembler code with the proper stack locations of the arguments.
The BASICO compiler uses the stack-cache to reverse the arguments. After parsing the arguments left-to-right the rightmost argument is top-of-stack. The while loop in lines 385 to 389 pushes the top-of-stack-cache first (rightmost argument) and the fourth-of-stack-cache last (leftmost argument). This trick avoids two pass parsing of function arguments, but limits the number of function arguments to the number of CPU registers in the stack-cache, which is 4.

func exprList()

exprList =
'(' [ expr { ',' expr } ] ')'
.

Valid examples for exprList EBNF are:

'(' ')'
'(' a ')'
'(' a, b ')'

829    func exprList(): void
830    begin
831    Expect('(')
832    if in(lookahead, "xnsc(-~") then
833        expr(); callarg()
834        while lookahead = ',' do
835          Get(); expr(); callarg()
836        wend
837    endif
838    Expect(')')
839    end

The first set for exprList is IDENT, CONST, STRCONST, CHRCONST, '(', '-' and '~'.

func callarg()

373    func callarg(): void
374    begin
375    args = args + 1
376    if args > 4 then
377        error("too many args")
378    endif
379    end

Function callarg() increments the number of function arguments in line 375. Because we have only 4 CPU registers for the stack-cache, more then 4 function arguments are not allowed. See functions call() and callid() for details.

func expr()

expr =
addexpr { '=' addexpr
          | '#' addexpr
          | '<' [ '=' ] addexpr
          | '>' [ '=' ] addexpr
} .

The BASICO expr() is called condition or conditional_expression in other compilers. A difference to standard compilers is the handling of "<=" and ">=" tokens. In most compilers the scanner detects these 2-character tokens. Like in Crenshaw's TINY the BASICO scanner detects two tokens: a '>' and a '=' token for ">=". The BASICO parser has to understand this.
The TINY approach simplifies the scanner. The disadvantage is not so obvious: it is possible to write white space between '>' and '='.

803    func expr(): void
804    begin
805    addexpr()
806    while in(lookahead, "=#<>") do
807        if lookahead = '=' then
808          Get(); addexpr(); rel("equal", "sete")
809        else if lookahead = '#' then
810          Get(); addexpr(); rel("not equal", "setne")
811        else if lookahead = '<' then
812          Get()
813          if lookahead = '=' then
814            Get(); addexpr(); rel("less or equal", "setle")
815          else
816            addexpr(); rel("less", "setl")
817          endif
818        else
819          Get()
820          if lookahead = '=' then
821            Get(); addexpr(); rel("greater or equal", "setge")
822          else
823            addexpr(); rel("greater", "setg")
824          endif
825        endif; endif; endif
826    wend
827    end

The "working opcodes" in function expr() are create with the function rel().

func rel()

535    func rel(comment: array char, opcode: array char): void
536    begin
537    printf("\tcmpl\t%%e%cx, %%e%cx", reg[0], reg[1])
538    printf("\t#%s\n", comment)
539    printf("\t%s\t%%%cl\n", opcode, reg[1])
540    printf("\tmovzbl\t%%%cl, %%e%cx\n", reg[1], reg[1])
541    pop()
542    end

Function rel() is a functional macro. The function arguments comment and opcode are build in a relational template. The assembler code is a good example about CPU evolution (yes, evolution is everywhere, even in CPU opcodes). In the old days, only short-circuit evaluation of conditions was done, after the compare opcode cmpl there was a conditional jump like jz for jump if zero. In BASICO compiler we set integer variables depending on the outcome of compare. The setz opcode sets the register to 1 if the compare was zero. Because only the lower 8 bits of the register are set by setz opcode we need the movzbl 8-bit to 32-bit extend opcode.

func addexpr()

addexpr =
term { '+' term
        | '-' term
        | '|' term
        | '^' term
} .

Now we come to the "working" opcodes. This is the level of assembler programmers: addl, subl, orl and xorl for signed addition, signed subtraction, binary inclusive or and binary exclusive or.

841    func addexpr(): void
842    begin
843    term()
844    while in(lookahead, "+-|^") do
845        if lookahead = '+' then
846          Get(); term()
847          printf("\taddl\t%%e%cx, %%e%cx\n", reg[0], reg[1])
848          pop()
849        else if lookahead = '-' then
850          Get(); term()
851          printf("\tsubl\t%%e%cx, %%e%cx\n", reg[0], reg[1])
852          pop()
853        else if lookahead = '|' then
854          Get(); term()
855          printf("\torl\t%%e%cx, %%e%cx\n", reg[0], reg[1])
856          pop()
857        else
858          Get(); term()
859          printf("\txorl\t%%e%cx, %%e%cx\n", reg[0], reg[1])
860          pop()
861        endif ; endif ; endif
862    wend
863    end

func term()

term =
unaryfact { '*' unaryfact
            | '/' unaryfact
            | '%' unaryfact
            | '&' unaryfact
} .

More working opcodes. The opcodes imul for signed multiplication and andl for binary and are simple. The division and remainder opcodes are created in the divmod() function. This is due to the "non-orthogonality" of the 80386 opcodes. Orthogonal for a CPU is, that registers are exchangeable. "All general registers are equal". In most real (CISC) CPUs some general registers are more equal. See divmod() for details.

865    func term(): void
866    begin
867    unaryfact()
868    while in(lookahead, "*/%&") do
869        if lookahead = '*' then
870          Get(); unaryfact()
871          printf("\timul\t%%e%cx, %%e%cx\n", reg[0], reg[1])
872          pop()
873        else if lookahead = '/' then
874          Get(); unaryfact()
875          divmod()
876          printf("\tmovl\t%%esi, %%edx\n");     // restore edx
877          pop()
878        else if lookahead = '%' then
879          Get(); unaryfact()
880          divmod()
881          printf("\tmovl\t%%edx, %%eax\n");     // use remainder
882          printf("\tmovl\t%%esi, %%edx\n");     // restore edx
883          pop()
884        else
885          Get(); unaryfact()
886          printf("\tandl\t%%e%cx, %%e%cx\n", reg[0], reg[1])
887          pop()
888        endif ; endif ; endif
889    wend
890    end

func divmod()

513    func divmod(): void
514    var tmp: int
515    i: int
516    begin
517    // idiv op-code uses edx and eax for 64bit dividend at input
518    // output is quotient in eax, remainder in edx
519    i = 0
520    while i < 4 do
521        if (i # 1) & ('a' = reg[i]) then
522          printf("\txchgl\t%%e%cx, %%e%cx\n", reg[i], reg[1])
523          tmp = reg[i]
524          reg[i] = reg[1]
525          reg[1] = tmp
526          break
527        endif
528        i = i + 1
529    wend
530    printf("\tmovl\t%%edx, %%esi\n");     // save edx
531    printf("\tcltd\n")
532    printf("\tidivl\t%%e%cx\n", reg[0])
533    end

The 80386 CPU has some funny ideas about register use for division. It expects the upper 32 bits of the dividend in register %edx. The lower 32bits bits of the dividend have to be in register %eax. The 32bit divisor can by in any general register, how nice! The quotient is again in register %eax, the remainder is in %edx. Because the BASICO compiler is very simple, the registers get loaded this way: The if command in lines 521 to 527 exchanges the register %eax with another register to bring the dividend into register %eax. Then %eax gets sign extended to 64 bits into %edx register with the cltd opcode in line 531. Before sign extend the register contents of %edx is saved in register %esi.

func unaryfact()

unaryfact =
'-' fact
| '~' fact
| fact
.

The unary operator are '-' for minus or 2-ers complement and '~' for bit reversal or 1-ers complement.

892    func unaryfact(): void
893    begin
894    if lookahead = '-' then
895        Get(); fact()
896        printf("\tnegl\t%%e%cx\n", reg[0])
897    else if lookahead = '~' then
898        Get(); fact()
899        printf("\tnotl\t%%e%cx\n", reg[0])
900    else if in(lookahead, "xnsc(-~") then
901        fact()
902    else; error("bad unaryfact"); endif; endif; endif
903    end

func fact()

fact =
IDENT [ '[' expr ']' | exprList ]
| CONST
| CHRCONST
| STRCONST
| '(' expr ')'
.

Function fact() is the third longest function in the compiler. Again the different cases work quite independent.

905    func fact(): void
906    var i: int
907    begin
908    if lookahead = 'x' then

927    else if lookahead = 'n' then

931    else if lookahead = 'c' then

940    else if lookahead = 's' then

949    else if lookahead = '(' then

951    else error("bad fact"); endif; endif; endif; endif; endif
952    end

fact IDENT

909        Get()
910        if (lookahead = '(') | (lookahead = '[') then
911          if (lookahead = '[') then
912            strcpy(idarray, token); Get(); expr(); Expect(']'); pusharrayvar()
913          else
914            callid(); exprList()
915            call()
916            se = oldse + 1
917            strcpy(reg, "abcd");          // return value in eax
918            i = 1
919            while i < se do
920              printf("\tpopl\t%%e%cx\n", reg[i])
921              i = i + 1
922            wend
923          endif
924        else
925          pushvar()
926        endif

fact CONST

928        Get()
929        push()
930        printf("\tmovl\t$%d, %%e%cx\n", number, reg[0])

Loading an integer literal is a simple load constant to register opcode.

fact CHRCONST

932        Get()
933        push()
934        printf("\tmovl\t$%d, %%e%cx", token[0], reg[0])
935        if token[0] >= ' ' then
936          printf("\t# '%c'\n", token[0])
937        else
938          printf("\n")
939        endif

Loading a character constant is a simple mov opcode in line 934. The if command in lines 935 to 939 write a little comment to the assmbler file to show the ASCII character if it is a printable ASCII character.

fact STRCONST

941        Get()
942        printf("\t.section .rodata\n")
943        printf(".LC%d:\n", lco)
944        printf("\t.string\t%s\n", token)
945        printf("\t.text\n")
946        push()
947        printf("\tmovl\t$.LC%d, %%e%cx\n", lco, reg[0])
948        lco = lco + 1

Loading a string constant is lengthy. The assembler needs a section pseudo-opcode (line 942) and a string constant label (line 943). The real opcode is again a simple move.

fact (expr)

950 Get(); expr(); Expect(')')

This is the recursive call to expr(). Without this statement BASICO and any other high level language would be less powerful. Even FORTRAN, a programming language that does not offer recursion needs recursion, or a recursion emulation, for expression evaluation.

func push()

396    func push(): void
397    var i: int
398    tmp: char
399    begin
400    i = 4 - 1
401    if se >= 4 then
402        printf("\tpushl\t%%e%cx\n", reg[i])
403    endif
404    tmp = reg[i]
405    while i > 0 do
406        reg[i] = reg[i - 1]
407        i = i - 1
408    wend
409    reg[0] = tmp
410    se = se + 1
411    end

The function push() is one part of the stack-cache implementation, the other part is function pop(). The real CPU stack is used with the pushl opcode in line 402 if more then 4 stack elements (variable se) are needed. The fourth-of-stack element is pushed and becomes the new top-of stack element.

func pushvar()

468    func pushvar(): void
469    var type: int
470    offset: array 1 int                // array for call-by-reference
471    begin
472    type = gettype(offset, token)
473    push()
474    if 'i' = type then
475        printf("\tmovl\t%s, %%e%cx\n", token, reg[0])
476    else if ('I' = type ) | ( 'C' = type) then
477        printf("\tmovl\t$%s, %%e%cx\n", token, reg[0])
478    else if 'j' = type then
479        printf("\tmovl\t%d(%%ebp), %%e%cx\n", offset[0], reg[0])
480    else if ('J' = type ) | ( 'D' = type) then
481        printf("\tleal\t%d(%%ebp), %%e%cx\n", offset[0], reg[0])
482    else if 'c' = type then
483        printf("\tmovzbl\t%s, %%e%cx\n", token, reg[0])
484    else if 'd' = type then
485        printf("\tmovzbl\t%d(%%ebp), %%e%cx\n", offset[0], reg[0])
486    else if ('K' = type ) | ( 'E' = type) then
487        printf("\tmovl\t%d(%%ebp), %%e%cx\n", offset[0], reg[0])
488    endif; endif; endif; endif; endif; endif; endif
489    end

Function pushvar() is the opposite of assign(). It gets the values from memory into the stack-cache, the general purpose registers of the CPU.

func pusharrayvar()

491    func pusharrayvar(): void
492    var type: int
493    offset: array 1 int                // array for call-by-reference
494    begin
495    type = gettype(offset, idarray)
496    if 'I' = type then
497        printf("\tmovl\t%s(,%%e%cx,4), %%e%cx\n", idarray, reg[0], reg[0])
498    else if 'J' = type then
499        printf("\tmovl\t%d(%%ebp,%%e%cx,4), %%e%cx\n", offset[0], reg[0], reg[0])
500    else if 'K' = type then
501        printf("\tmovl\t%d(%%ebp), %%esi\n", offset[0])
502        printf("\tmovl\t(%%esi,%%e%cx,4), %%e%cx\n", reg[0], reg[0])
503    else if 'C' = type then
504        printf("\tmovzbl\t%s(%%e%cx), %%e%cx\n", idarray, reg[0], reg[0])
505    else if 'D' = type then
506        printf("\tmovzbl\t%d(%%e%cx,%%ebp), %%e%cx\n", offset[0], reg[0], reg[0])
507    else if 'E' = type then
508        printf("\tmovl\t%d(%%ebp), %%esi\n", offset[0])
509        printf("\tmovzbl\t(%%esi,%%e%cx), %%e%cx\n", reg[0], reg[0])
510    endif; endif; endif; endif; endif; endif
511    end

Function pusharrayvar() is the opposite of assign(). It gets the values from memory into the stack-cache, the general purpose registers of the CPU.

Conclusions

Writing the BASICO compiler was fun. Writing the documentation was not so much fun. Still I hope that this source code documentation shows that source code reading is possible. Some computer languages (Lisp, Forth) allow to create your own control structures, which makes them "write-only" if the programmers get lost in compiler extension dreams. Others have such a big, big, big syntax, semantics and library (C, C++, C#, JAVA) that a new programmer needs forever to learn all ways how to do things and stays a beginner reader for a long time.
BASICO is a very simple language. Reading and writing BASICO source code should be easy. And even if you never intend to write a program in BASICO more complex then "Hello BASICO", the BASICO compiler shows you how your source code is translated into assembler code.
1000 source code lines are read. That's all. Goodbye!

Previous chapter BASICO source code reading - Parser 1
Back to BASICO title page