BASICO compiler source code reading - Parser 2

Author: Andre Adrian
Version: 09.jul.2006


The functions of this section produce most of the assembler code. The condition, arithmetric and logic operations in expr(), addexpr(), term(), unaryfact() and fact() produce "the working" opcodes like add and mul and a lot of transport opcodes (mov).

func gettype()

 294    func gettype(offset: array int, name: array char): int
 295    var id: int
 296    begin
 297      id = lookup(locals, name)
 298      if id >= 0 then
 299        offset[0] = localoffset[id]
 300        return localtype[id]
 301      endif
 302      id = lookup(globals, name)
 303      if -1 = id then
 304        error("variable not found")
 305      endif
 306      return globaltype[id]
 307    end

Function gettype() has two output values. The variable type is returned through the return value. The variable offset is returned through the call-by-reference array variable offset. The first lookup() for local variables is in line 297. The second lookup() for global variables is in line 302. This implements the expected local variables are used instead of global variables in case of identical name.
The variable type is coded into a single-character-representation. The coding is:

global integer global integer array local integer local integer array call-by-reference integer array
i
I
j
J
K

global char global char array local char local char array call-by-reference char array
c
C
d
D
E


func assign()

 434    func assign(): void
 435    begin
 436      if 'i' = typevar then
 437        printf("\tmovl\t%%e%cx, %s\n", reg[0], idvar)
 438      else if 'j' = typevar then
 439        printf("\tmovl\t%%e%cx, %d(%%ebp)\n", reg[0], offsetvar[0])
 440      else if 'I' = typevar then
 441        printf("\tmovl\t%%e%cx, %s(,%%e%cx,4)\n", reg[0], idvar, reg[1])
 442        pop()
 443      else if 'J' = typevar then
 444        printf("\tmovl\t%%e%cx, %d(%%ebp,%%e%cx,4)\n", reg[0], offsetvar[0], reg[1])
 445        pop()
 446      else if 'K' = typevar then
 447        printf("\tmovl\t%d(%%ebp), %%esi\n", offsetvar[0])
 448        printf("\tmovl\t%%e%cx, (%%esi,%%e%cx,4)\n", reg[0], reg[1])
 449        pop()
 450      else if 'c' = typevar then
 451        printf("\tmovb\t%%%cl, %s\n", reg[0], idvar)
 452      else if 'd' = typevar then
 453        printf("\tmovb\t%%%cl, %d(%%ebp)\n", reg[0], offsetvar[0])
 454      else if 'C' = typevar then
 455        printf("\tmovb\t%%%cl, %s(%%e%cx)\n", reg[0], idvar, reg[1])
 456        pop()
 457      else if 'D' = typevar then
 458        printf("\tmovb\t%%%cl, %d(%%e%cx,%%ebp)\n", reg[0], offsetvar[0], reg[1])
 459        pop()
 460      else if 'E' = typevar then
 461        printf("\tmovl\t%d(%%ebp), %%esi\n", offsetvar[0])
 462        printf("\tmovb\t%%%cl, (%%esi,%%e%cx)\n", reg[0], reg[1])
 463        pop()
 464      endif; endif; endif; endif; endif; endif; endif; endif; endif; endif
 465      pop()
 466    end

Function assign() selects the correct transport opcode depending on variable type. The 80386 transport opcodes are quite powerful. The base_of_array + size_of_array_element * array_index calculation is done as part of the store mov opcode. Only in case of call-by-reference assignment, we need two opcodes and the CPU register %esi to done the indirect store.

func pop()

 413    func pop(): void
 414    var i: int
 415      tmp: char
 416    begin
 417      se = se - 1
 418      if se = 0 then
 419        cacheinit()
 420      else
 421        i = 0
 422        tmp = reg[i]
 423        while i < 4 - 1 do
 424          reg[i] = reg[i + 1]
 425          i = i + 1
 426        wend
 427        reg[i] = tmp
 428      endif
 429      if se >= 4 then
 430        printf("\tpopl\t%%e%cx\n", reg[i])
 431      endif
 432    end

The function pop() is one part of the stack-cache implementation, the other part is function push(). The real CPU stack is used with the popl opcode in line 430 if more then 4 stack elements (variable se) are needed. If pop() is called, the old top-of-stack element is dropped and the fourth-of-stack element is loaded from real stack to stack cache.

func callid()

 359    func callid(): void
 360    var i: int
 361    begin
 362      i = se
 363      while i >= 1 do
 364        i = i - 1
 365        printf("\tpushl\t%%e%cx\n", reg[i])
 366      wend
 367      oldse = se
 368      se = 0
 369      strcpy(idcall, token)
 370      args = 0
 371    end

In the BASICO statement a = b + c(x, y) the stack cache is used for two purposes: First to code the assignment and the expression (the statement fragment a = b + ). Second to handle the function arguments of c(x, y). Function callid() pushes the stack cache to the "real" stack at the moment the parser sees c(, a function. The stack elements se is stored in variable oldse, se is set to 0 to init se for the second purpose. This implementation can not handle the following statements:
    srand(time(0))   use the time(0) function return value as argument for function srand()
    a = c(x, y) + d(x, y, z)   handle two functions with different number of arguments.

func call()

 381    func call(): void
 382    var i: int
 383    begin
 384      // C pushes args from right to left
 385      i = 0
 386      while i < se do
 387        printf("\tpushl\t%%e%cx\n", reg[i])
 388        i = i + 1
 389      wend
 390      printf("\tcall\t%s\n", idcall)
 391      if args > 0 then
 392        printf("\taddl\t$%d, %%esp\n", 4 * args)
 393      endif
 394    end

The C library functions expect the function arguments on the stack in the order that the left-most argument is the top-of-stack argument. The arguments have to be pushed right-to-left on the stack. This requests normally a two-pass parsing of the function arguments. First pass to get the number of arguments, second pass to produce the assembler code with the proper stack locations of the arguments.
The BASICO compiler uses the stack-cache to reverse the arguments. After parsing the arguments left-to-right the rightmost argument is top-of-stack. The while loop in lines 385 to 389 pushes the top-of-stack-cache first (rightmost argument) and the fourth-of-stack-cache last (leftmost argument). This trick avoids two pass parsing of function arguments, but limits the number of function arguments to the number of CPU registers in the stack-cache, which is 4.

func exprList()

exprList =
  '(' [ expr { ',' expr } ] ')'
  .
 
Valid examples for exprList EBNF are:
'(' ')'
'(' a ')'
'(' a, b ')'

 829    func exprList(): void
 830    begin
 831      Expect('(')
 832      if in(lookahead, "xnsc(-~") then
 833        expr(); callarg()
 834        while lookahead = ',' do
 835          Get(); expr(); callarg()
 836        wend
 837      endif
 838      Expect(')')
 839    end

The first set for exprList is IDENT, CONST, STRCONST, CHRCONST, '(', '-' and '~'.

func callarg()

 373    func callarg(): void
 374    begin
 375      args = args + 1
 376      if args > 4 then
 377        error("too many args")
 378      endif
 379    end

Function callarg() increments the number of function arguments in line 375. Because we have only 4 CPU registers for the stack-cache, more then 4 function arguments are not allowed. See functions call() and callid() for details.

func expr()

expr =
  addexpr { '=' addexpr
          | '#' addexpr
          | '<' [ '=' ] addexpr
          | '>' [ '=' ] addexpr
  } .

The BASICO expr() is called condition or conditional_expression in other compilers. A difference to standard compilers is the handling of "<=" and ">=" tokens. In most compilers the scanner detects these 2-character tokens. Like in Crenshaw's TINY the BASICO scanner detects two tokens: a '>' and a '=' token for ">=". The BASICO parser has to understand this.
The TINY approach simplifies the scanner. The disadvantage is not so obvious: it is possible to write white space between '>' and '='.

 803    func expr(): void
 804    begin
 805      addexpr()
 806      while in(lookahead, "=#<>") do
 807        if lookahead = '=' then
 808          Get(); addexpr(); rel("equal", "sete")
 809        else if lookahead = '#' then
 810          Get(); addexpr(); rel("not equal", "setne")
 811        else if lookahead = '<' then
 812          Get()
 813          if lookahead = '=' then
 814            Get(); addexpr(); rel("less or equal", "setle")
 815          else
 816            addexpr(); rel("less", "setl")
 817          endif 
 818        else
 819          Get()
 820          if lookahead = '=' then
 821            Get(); addexpr(); rel("greater or equal", "setge")
 822          else
 823            addexpr(); rel("greater", "setg")
 824          endif       
 825        endif; endif; endif
 826      wend
 827    end

The "working opcodes" in function expr() are create with the function rel().

func rel()

 535    func rel(comment: array char, opcode: array char): void
 536    begin
 537      printf("\tcmpl\t%%e%cx, %%e%cx", reg[0], reg[1])
 538      printf("\t#%s\n", comment)
 539      printf("\t%s\t%%%cl\n", opcode, reg[1])
 540      printf("\tmovzbl\t%%%cl, %%e%cx\n", reg[1], reg[1])
 541      pop()
 542    end

Function rel() is a functional macro. The function arguments comment and opcode are build in a relational template. The assembler code is a good example about CPU evolution (yes, evolution is everywhere, even in CPU opcodes). In the old days, only short-circuit evaluation of conditions was done, after the compare opcode cmpl there was a conditional jump like jz for jump if zero. In BASICO compiler we set integer variables depending on the outcome of compare. The setz opcode sets the register to 1 if the compare was zero. Because only the lower 8 bits of the register are set by setz opcode we need the movzbl 8-bit to 32-bit extend opcode.

func addexpr()

addexpr =
  term {  '+' term
        | '-' term
        | '|' term
        | '^' term
  } .
 
Now we come to the "working" opcodes. This is the level of assembler programmers: addl, subl, orl and xorl for signed addition, signed subtraction, binary inclusive or and binary exclusive or.

 841    func addexpr(): void
 842    begin
 843      term()
 844      while in(lookahead, "+-|^") do
 845        if lookahead = '+' then
 846          Get(); term()
 847          printf("\taddl\t%%e%cx, %%e%cx\n", reg[0], reg[1])
 848          pop()
 849        else if lookahead = '-' then
 850          Get(); term()
 851          printf("\tsubl\t%%e%cx, %%e%cx\n", reg[0], reg[1])
 852          pop()
 853        else if lookahead = '|' then
 854          Get(); term()
 855          printf("\torl\t%%e%cx, %%e%cx\n", reg[0], reg[1])
 856          pop()
 857        else
 858          Get(); term()
 859          printf("\txorl\t%%e%cx, %%e%cx\n", reg[0], reg[1])
 860          pop()
 861        endif ; endif ; endif
 862      wend     
 863    end



func term()

term =
  unaryfact { '*' unaryfact
            | '/' unaryfact
            | '%' unaryfact
            | '&' unaryfact
  } .

More working opcodes. The opcodes imul for signed multiplication and andl for binary and are simple. The division and remainder opcodes are created in the divmod() function. This is due to the "non-orthogonality" of the 80386 opcodes. Orthogonal for a CPU is, that registers are exchangeable. "All general registers are equal". In most real (CISC) CPUs some general registers are more equal. See divmod() for details.

 865    func term(): void
 866    begin
 867      unaryfact()
 868      while in(lookahead, "*/%&") do
 869        if lookahead = '*' then
 870          Get(); unaryfact()
 871          printf("\timul\t%%e%cx, %%e%cx\n", reg[0], reg[1])
 872          pop()
 873        else if lookahead = '/' then
 874          Get(); unaryfact()
 875          divmod()
 876          printf("\tmovl\t%%esi, %%edx\n");     // restore edx
 877          pop()
 878        else if lookahead = '%' then
 879          Get(); unaryfact()
 880          divmod()
 881          printf("\tmovl\t%%edx, %%eax\n");     // use remainder
 882          printf("\tmovl\t%%esi, %%edx\n");     // restore edx
 883          pop()
 884        else
 885          Get(); unaryfact()
 886          printf("\tandl\t%%e%cx, %%e%cx\n", reg[0], reg[1])
 887          pop()
 888        endif ; endif ; endif
 889      wend     
 890    end


func divmod()

 513    func divmod(): void
 514    var tmp: int
 515      i: int
 516    begin
 517      // idiv op-code uses edx and eax for 64bit dividend at input
 518      // output is quotient in eax, remainder in edx
 519      i = 0
 520      while i < 4 do
 521        if (i # 1) & ('a' = reg[i]) then
 522          printf("\txchgl\t%%e%cx, %%e%cx\n", reg[i], reg[1])
 523          tmp = reg[i]
 524          reg[i] = reg[1]
 525          reg[1] = tmp
 526          break
 527        endif
 528        i = i + 1
 529      wend
 530      printf("\tmovl\t%%edx, %%esi\n");     // save edx
 531      printf("\tcltd\n")
 532      printf("\tidivl\t%%e%cx\n", reg[0])
 533    end

The 80386 CPU has some funny ideas about register use for division. It expects the upper 32 bits of the dividend in register %edx. The lower 32bits bits of the dividend have to be in register %eax. The 32bit divisor can by in any general register, how nice! The quotient is again in register %eax, the remainder is in %edx. Because the BASICO compiler is very simple, the registers get loaded this way: The if command in lines 521 to 527 exchanges the register %eax with another register to bring the dividend into register %eax. Then %eax gets sign extended to 64 bits into %edx register with the cltd opcode in line 531. Before sign extend the register contents of %edx is saved in register %esi.

func unaryfact()

unaryfact =
  '-' fact          
  | '~' fact        
  | fact
  .

The unary operator are '-' for minus or 2-ers complement and '~' for bit reversal or 1-ers complement.

 892    func unaryfact(): void
 893    begin
 894      if lookahead = '-' then
 895        Get(); fact()
 896        printf("\tnegl\t%%e%cx\n", reg[0])
 897      else if lookahead = '~' then
 898        Get(); fact()
 899        printf("\tnotl\t%%e%cx\n", reg[0])
 900      else if in(lookahead, "xnsc(-~") then
 901        fact()
 902      else; error("bad unaryfact"); endif; endif; endif
 903    end


func fact()

fact =
  IDENT [ '[' expr ']' | exprList ]
  | CONST
  | CHRCONST             
  | STRCONST                       
  | '(' expr ')'
  .

Function fact() is the third longest function in the compiler. Again the different cases work quite independent.

 905    func fact(): void
 906    var i: int
 907    begin
 908      if lookahead = 'x' then

 927      else if lookahead = 'n' then

 931      else if lookahead = 'c' then

 940      else if lookahead = 's' then

 949      else if lookahead = '(' then

 951      else error("bad fact"); endif; endif; endif; endif; endif
 952    end


fact IDENT

 909        Get()
 910        if (lookahead = '(') | (lookahead = '[') then
 911          if (lookahead = '[') then
 912            strcpy(idarray, token); Get(); expr(); Expect(']'); pusharrayvar()
 913          else
 914            callid(); exprList()
 915            call()
 916            se = oldse + 1
 917            strcpy(reg, "abcd");          // return value in eax
 918            i = 1
 919            while i < se do
 920              printf("\tpopl\t%%e%cx\n", reg[i])
 921              i = i + 1
 922            wend
 923          endif
 924        else
 925          pushvar()
 926        endif


fact CONST

 928        Get()
 929        push()
 930        printf("\tmovl\t$%d, %%e%cx\n", number, reg[0])

Loading an integer literal is a simple load constant to register opcode.

fact CHRCONST

 932        Get()
 933        push()
 934        printf("\tmovl\t$%d, %%e%cx", token[0], reg[0])
 935        if token[0] >= ' ' then
 936          printf("\t# '%c'\n", token[0])
 937        else
 938          printf("\n")
 939        endif

Loading a character constant is a simple mov opcode in line 934. The if command in lines 935 to 939 write a little comment to the assmbler file to show the ASCII character if it is a printable ASCII character.

fact STRCONST

 941        Get()
 942        printf("\t.section .rodata\n")
 943        printf(".LC%d:\n", lco)
 944        printf("\t.string\t%s\n", token)
 945        printf("\t.text\n")
 946        push()
 947        printf("\tmovl\t$.LC%d, %%e%cx\n", lco, reg[0])
 948        lco = lco + 1

Loading a string constant is lengthy. The assembler needs a section pseudo-opcode (line 942) and a string constant label (line 943). The real opcode is again a simple move.

fact (expr)

  950        Get(); expr(); Expect(')')

This is the recursive call to expr(). Without this statement BASICO and any other high level language would be less powerful. Even FORTRAN, a programming language that does not offer recursion needs recursion, or a recursion emulation, for expression evaluation.

func push()

 396    func push(): void
 397    var i: int
 398      tmp: char
 399    begin
 400      i = 4 - 1
 401      if se >= 4 then
 402        printf("\tpushl\t%%e%cx\n", reg[i])
 403      endif
 404      tmp = reg[i]
 405      while i > 0 do
 406        reg[i] = reg[i - 1]
 407        i = i - 1
 408      wend
 409      reg[0] = tmp
 410      se = se + 1
 411    end

The function push() is one part of the stack-cache implementation, the other part is function pop(). The real CPU stack is used with the pushl opcode in line 402 if more then 4 stack elements (variable se) are needed. The fourth-of-stack element is pushed and becomes the new top-of stack element.

func pushvar()

 468    func pushvar(): void
 469    var type: int
 470      offset: array 1 int                // array for call-by-reference
 471    begin
 472      type = gettype(offset, token)
 473      push()
 474      if 'i' = type then
 475        printf("\tmovl\t%s, %%e%cx\n", token, reg[0])
 476      else if ('I' = type ) | ( 'C' = type) then
 477        printf("\tmovl\t$%s, %%e%cx\n", token, reg[0])
 478      else if 'j' = type then
 479        printf("\tmovl\t%d(%%ebp), %%e%cx\n", offset[0], reg[0])
 480      else if ('J' = type ) | ( 'D' = type) then
 481        printf("\tleal\t%d(%%ebp), %%e%cx\n", offset[0], reg[0])
 482      else if 'c' = type then
 483        printf("\tmovzbl\t%s, %%e%cx\n", token, reg[0])
 484      else if 'd' = type then
 485        printf("\tmovzbl\t%d(%%ebp), %%e%cx\n", offset[0], reg[0])
 486      else if ('K' = type ) | ( 'E' = type) then
 487        printf("\tmovl\t%d(%%ebp), %%e%cx\n", offset[0], reg[0])
 488      endif; endif; endif; endif; endif; endif; endif
 489    end

Function pushvar() is the opposite of assign(). It gets the values from memory into the stack-cache, the general purpose registers of the CPU.

func pusharrayvar()

 491    func pusharrayvar(): void
 492    var type: int
 493      offset: array 1 int                // array for call-by-reference
 494    begin
 495      type = gettype(offset, idarray)
 496      if 'I' = type then
 497        printf("\tmovl\t%s(,%%e%cx,4), %%e%cx\n", idarray, reg[0], reg[0])
 498      else if 'J' = type then
 499        printf("\tmovl\t%d(%%ebp,%%e%cx,4), %%e%cx\n", offset[0], reg[0], reg[0])
 500      else if 'K' = type then
 501        printf("\tmovl\t%d(%%ebp), %%esi\n", offset[0])
 502        printf("\tmovl\t(%%esi,%%e%cx,4), %%e%cx\n", reg[0], reg[0])
 503      else if 'C' = type then
 504        printf("\tmovzbl\t%s(%%e%cx), %%e%cx\n", idarray, reg[0], reg[0])
 505      else if 'D' = type then
 506        printf("\tmovzbl\t%d(%%e%cx,%%ebp), %%e%cx\n", offset[0], reg[0], reg[0])
 507      else if 'E' = type then
 508        printf("\tmovl\t%d(%%ebp), %%esi\n", offset[0])
 509        printf("\tmovzbl\t(%%esi,%%e%cx), %%e%cx\n", reg[0], reg[0])
 510      endif; endif; endif; endif; endif; endif
 511    end

Function pusharrayvar() is the opposite of assign(). It gets the values from memory into the stack-cache, the general purpose registers of the CPU.


Conclusions

Writing the BASICO compiler was fun. Writing the documentation was not so much fun. Still I hope that this source code documentation shows that source code reading is possible. Some computer languages (Lisp, Forth) allow to create your own control structures, which makes them "write-only" if the programmers get lost in compiler extension dreams. Others have such a big, big, big syntax, semantics and library (C, C++, C#, JAVA) that a new programmer needs forever to learn all ways how to do things and stays a beginner reader for a long time.
BASICO is a very simple language. Reading and writing BASICO source code should be easy. And even if you never intend to write a program in BASICO more complex then "Hello BASICO", the BASICO compiler shows you how your source code is translated into assembler code.
1000 source code lines are read. That's all. Goodbye!


Previous chapter BASICO source code reading - Parser 1
Back to BASICO title page