BASICO compiler source code reading - Parser 1

Author: Andre Adrian
Version: 02.jul.2006

This part handles the command parser, the next part covers the expression parser. The command parser implements the variable declaration, function definition and the control flow statements. The expression parser handles the unary and binary operators like ~, -, +, * and the variable load and store.

func basico()

Before we talk about the statements in the basico() function, let's look at the Extended Backus-Naur-Form (EBNF) grammar of basico():

basico =
  [ "var" { globalDecl ';' } ]
  { "func" IDENT '(' paramList ')' ':' returnType ';' blockOrForward ';' }

The basico definition starts with an optional "var" statement. The meta-symbol [ ] is 0 to 1 repetition of the enclosed pattern. After the reserved word "var" there can be 0 or more globalDecl that are followed each by an ';'. The meta-symbol { } is 0 to N repetitions. Literal symbols like "var" and ';' are enclosed in " " or ' ' and stay for themselves. The word globalDecl is a definition, like basico is. The most simple example for a "var" statement is the empty string, just write nothing. Other examples are:

var i:int;

var i:int; k:char; m:array 10 char;

var i:int;
    m:array 10 char;

Again, in BASICO the scanner changes linefeed to ';' if necessary. Most of the ';' we don't have to type and we should not type in an actual BASICO source code file.
Second part of the basico definition is the 0 to N times repetition of a "func", a function. The most simple BASICO program is the empty string, because the "var" and the "func" part are both valid with 0 times repetition. But now a closer look at a function.
A function starts with the reserved word func. The " " around func just says: I am a literal. Second item is IDENT. See section Identifier for the definition of IDENT.  The third item in the func grammar is the character (. The ' ' around ( says again: I am a literal. The fourth item is paramList. Because it is no literal, it has to be explained somewhere else in the grammar.  We look at this later. More important for the moment is to recognize the pattern in our EBNF. The EBNF of func is a template with literals and fill-ins. Some fill-ins are simple regular expressions like IDENT, other fill-ins are EBNFs. These other EBNFs have to come down to regular expressions and regular expressions have to come down to ASCII characters. In the end everything can be expressed with ASCII characters. To translate from one language to another, the compiler has to do two jobs: first analyse the source code, then synthesize the assembler code. The analysis is controlled by the syntax, the EBNFs and REs of the programming language. The synthesis just follows syntax. We speak of "syntax driven translation". The analysis of BASICO is simple, because the grammar is context-free and predictive. Our human languages are not context-free nor predictive. This is the reason why machine translation is working great in the area of programming languages and is working poor in the area of human languages.

 560    func basico(): void
 561    begin
 562      if lookahead = 'v' then
 563        Get()
 564        while lookahead = 'x' do    // globalDecl first
 565          globalDecl()
 566          Expect(';')               // globalDecl follow
 567        wend
 568      endif
 569      while lookahead = 'F' do
 570        Get(); Expect('x')
 571        strcpy(idfunc, token)
 572        frame = 0
 573        param = 8
 574        Expect('('); paramList(); Expect(')')
 575        Expect(':'); returnType(); Expect(';')
 576        blockOrForward(); Expect(';')
 577      wend
 578    end

The EBNF option meta-symbols [ ] are translated into an if command in the source code of a parser. The repetition meta-symbols { } are translated to a while command. The EBNF [ "var" { globalDecl ';' } ] is coded in lines 562 to 568. A EBNF definition like globalDecl is translated into a function call. The functions Get() and Expect() are explained next.
Line 562 is easy. The one-character-representation of "var" is 'v', and the line is just the necessary if to realize an EBNF option. Line 564 has the while statement that we need to implement a repetition. The compare in line 564 is lookahead = 'x', which we can convert to lookahead = IDENT. In the EBNF of "var" there is on IDENT at this location, we read globalDecl.
One of the tricky parts of translation a EBNF into the source code of a parser is to find the first sets. What is a first set? Well, the source code of our parser can not compare the lookahead to some abstract entity like globalDecl. The parser can only compare to some token. The first set are just all tokens that can start a globalDecl. A compiler compiler like CoCo/R has an algorithm to find the first sets in the grammar. If you like to do it by hand you have to read down the EBNFs to find the real tokens. The EBNF [ "var" { globalDecl ';' } ] uses the globalDecl EBNF. The globalDecl definition is:

globalDecl =
  IDENT ':' (
    | "char"
    | "array" CONST (
      | "char"
  ) . 

The only start token for globalDecl is IDENT. Therefore the first set of globalDecl is < IDENT >. The < > are set brackets.
Line 569 to 577 codes the EBNF { "func" IDENT '(' paramList ')' ':' returnType ';' blockOrForward ';' }. Line 569 is the while statement to realize the { } repetion. The Expect('x') in line 570 reads the IDENT. Lines 571 to 573 are part of the synthesis, the production of assembler code.
Line 571 stores the IDENT attribute (variable name) into idfunc. Line 572 sets variable frame to 0. Variable frame contains the memory size for local variables. Every local variable that is found while parsing the function increases frame.

  44      frame:      int             // locals

Line 573 inits variable param to 8. Variable param contains the offset for the function arguments. The first argument has an offset of 8. This value is CPU specific. On the 80386 CPU a function call of the function x(a, b, c) has the following pseudo code sequence:

push argument c
push argument b
push argument a
push return address, jump to x
push frame pointer

Every push needs 4 bytes (one 32bit word) on the stack. Because the stack grows down, that is every push decrements the stack pointer, the arguments have a positive offset. The argument a is found at stackpointer+8 location, the argument b is found at stackpointer+12 location.
Lines 574 to 576 codes the EBNF part '(' paramList ')' ':' returnType ';' blockOrForward ';' which does not contain new features. The EBNF definitions paramList, returnType and blockOrForward are realized as function calls. The literals '(',  ')', ':' and ';' are read with Expect().

func Get()

 546    func Get(): void
 547    var s: array 16 char
 548    begin
 549      lookahead = yylex()
 550    end

Function Get() gets one token from yylex() and stores this token in variable lookahead. The local variable s is not needed. This is one of the typical left-overs from a copy/paste/modify session. We use Get() if it is not necessary to check the token we eat again.

func Expect()

 552    func Expect(t: int): void
 553    var s: array 16 char
 554    begin
 555      if lookahead = t then
 556        Get()
 557      else sprintf(s, "%c expected", t); error(s); endif
 558    end

Function Expect() eats a token after a check. If the check fails, the compiler is terminated with a "XXX expected" message. The expected token is given as argument. The actual token out of the source code file is in variable lookahead. The call to Get() eats the token and loads a new token into variable lookahead.

func globalDecl()

The EBNF of globalDecl has two new meta-symbols. The ( ) brackets enclose a group. A group is an un-named EBNF definition. The | meta-synbols is alternative.

globalDecl =
  IDENT ':' (
    | "char"
    | "array" CONST (
      | "char"
  ) . 

The EBNF without ( ) meta-symbols is:

globalDecl = IDENT ':' globalDecl1 .
globalDecl1 = "int" | "char" | "array" CONST globalDecl2 .
globalDecl2 = "int" | "char" .

Both EBNFs are equal. The second version is easier to read. Valid examples for globalDecl are:

a: int
b: char
c: array 10 int
d: array 20 char

As synthesis action the variable declaration has to reserve memory in global memory space in the assembler file with the name of the IDENT. Another synthesis action is to remember the type of the variable. An integer variable needs other assembler op-codes for load and store then a character or array variable, so we need the type information later on in variable access, variable assignment and use of variable as parameter in a function call.

 580    func globalDecl(): void
 581    begin
 582      Expect('x'); strcpy(iddecl, token)
 583      Expect(':')                // syntactic sugar
 584      if lookahead = 'I' then
 585        Get()
 586        printf("\t.comm  %s,4,4\n", iddecl)
 587        addglobal(iddecl, 'i')
 588      else if lookahead = 'C' then
 589        Get()
 590        printf("\t.comm  %s,1,1\n", iddecl)
 591        addglobal(iddecl, 'c')
 592      else if lookahead = 'a' then
 593        Get(); Expect('n')
 594        if lookahead = 'I' then
 595          Get()
 596          printf("\t.comm  %s,%d,4\n", iddecl, number * 4)
 597          addglobal(iddecl, 'I')
 598        else if lookahead = 'C' then
 599          Get()
 600          printf("\t.comm  %s,%d,1\n", iddecl, number)
 601          addglobal(iddecl, 'C')
 602        else; error("int, char expected"); endif; endif
 603      else; error("int, char, array expected"); endif; endif; endif
 604    end

Line 582 and 583 eat the IDENT ':' part of the globalDecl EBNF. As synthesis action the IDENT attribute token (the variable name) is stored in iddecl in line 582. The EBNF alternative is realized as an if-then-else-if-then-else...endif-endif construct. The lines 584, 588 and 592 start the alternatives "int", "char" and "array". For alternative "int" line 585 eats the "int" token. Line 586 produces the memory reservation in the assembler file with the C library function printf(). Line 587 calls addglobal() to save the variable name together with variable type for later use.
Lines 589 to 591 do the same for character what lines 585 to 587 do for integer.
Line 593 eats the "array" token and eats the interger literal for array size. After Expect('n') the integer literal is in variable number. In line 596 variable number is used to calculate the memory size for the integer array. Line 597 calls addglobal() with variable name and variable type, but without array size. A PASCAL compiler would save the array size for all the automatic buffer overwrite checks it wants to do. A C compiler would save it for the sizeof() compiler function which tells the size of the argument.
Line 602 contains a call to error() that is executed if the parser does not find the expected "int" or "char" as last tokens of an array variable declaration. Line 603 contains an error() call that is triggered if after the ':' the parser does not find "int", "char" or "array".

func addglobal()

 326    func addglobal(name: array char, type: char): int
 327    var id: int
 328    begin
 329      id = newname(globals, name)
 330      globaltype[id] = type
 331      return id
 332    end

Function addglobal() is an enriche/modify function. The real work is done in newname(). Function addglobal() selects the variable globals as symbol table for newname(), which make sense because in variable globals we store the names of all global variables. In line 330 the return value of newname is used as index for the array globaltype. The type information that is argument type to function addlocal() is stored in globaltype under the proper index. Line 331 returns with id.

func newname()

 309    func newname(symtab: array char, name: array char): int
 310    var ident: int
 311    begin
 312      ident = lookup(symtab, name)
 313      if ident >= 0 then
 314        return ident
 315      else
 316        if strlen(symtab) + strlen(name) + 2 <= 1000 then
 317          strcat(symtab, name)
 318          strcat(symtab, " ")
 319        else
 320          error("symtab full")
 321        endif
 322        return lookup(symtab, name)
 323      endif
 324    end

Function newname() adds a new variable name to the symbol table symtab. Before the add, the symbol table is checked with lookup() if this variable name does not exist already. In BASICO multiple declaration is forbidden. It is allowed to have a global variable i and a local variable i. But a global variable i that is first declared as an integer and then declared as an character does not make sense.
Unfortunely the current newname() implementation does not support this behaviour. The lookup() check is done in line 312, but instead of an error("multiple declaration") in line 314 there is a return ident. This is a real bug in version 0.8 of BASICO.
Lines 316 to 321 concatenate the name with a trailing space to the symbol table, or terminate with error("symtab full"). The C library function strlen() returns the string length without the NUL character. The string "abc" needs 4 characters for store, but strlen() will return 3.
Line 322 calls lookup() again to get the index of the just added name in the symbol table. This call to lookup() can be eliminated with two variables globalid and localid that get incremented in addglobal() and addlocal(). This is an example of the famous speed/memory trade-off: We spend time to run a function like lookup() to calculate some information or we use some variables like globalid and localid to have the information already present. Still the variables have to be updated and this is the nasty part of all the use memory to improve speed optimizations: often the variables get not updated in all cases and run out of synchronization.
The BASICO compiler did not run faster with the globalid, localid optimizations, so the optimizations were thrown out again. An optimization that does not save time is only an unnecessary complication of the source code. Always say to your source code: Earn your money or leave my file!
Note: If your payment is on a lines of code basis, this advice is bad for you. If you code for high Reliability/Maintainability/Availability (RMA) this is the best advice you ever get.

func paramList()

paramList =
  [ paramDecl { ',' paramDecl } ]

 606    func paramList(): void
 607    begin
 608      if lookahead = 'x' then
 609        paramDecl()
 610        while lookahead = ',' do
 611          Get()
 612          paramDecl()
 613        wend
 614      endif
 615    end

func paramDecl()

paramDecl =
  IDENT ':' (
    | "char"
    | "array" (
      | "char"
  ) .

 637    func paramDecl(): void
 638    begin
 639      Expect('x'); strcpy(iddecl, token)
 640      Expect(':')
 641      if lookahead = 'I' then
 642        Get()
 643        addlocal(iddecl, 'j', param)
 644        param = param + 4
 645      else if lookahead = 'C' then
 646        Get()
 647        addlocal(iddecl, 'd', param)
 648        param = param + 4
 649      else if lookahead = 'a' then
 650        Get()
 651        if lookahead = 'I' then
 652          Get()
 653          addlocal(iddecl, 'K', param)
 654          param = param + 4
 655        else if lookahead = 'C' then
 656          Get()
 657          addlocal(iddecl, 'E', param)
 658          param = param + 4
 659        else; error("int, char expected"); endif; endif
 660      else; error("int, char, array expected"); endif; endif; endif
 661    end

func returnType()

returnType =
  | "char"                        
  | "void"                        

 617    func returnType(): void
 618    begin
 619      if lookahead = 'I' then
 620        Get(); printf("\t\t\t\t#func returns int\n")
 621      else if lookahead = 'C' then
 622        Get(); printf("\t\t\t\t#func returns char\n")
 623      else if lookahead = 'V' then
 624        Get(); printf("\t\t\t\t#func returns void\n")
 625      else; error("bad returnType"); endif; endif; endif
 626    end

func blockOrForward()

blockOrForward =
  | "forward"

 628    func blockOrForward(): void
 629    begin
 630      if (lookahead = 'v') | (lookahead = 'b') then
 631        pblock3()
 632      else if lookahead = 'f' then
 633        Get(); printf("\t\t\t\t#forward decl\n")
 634      else error("bad blockOrForward"); endif; endif
 635    end

func pblock3()

pblock3 =
  [ "var" { localDecl ';' } ]
  "begin" stmtList "end"

 663    func pblock3(): void
 664    begin
 665      if lookahead = 'v' then
 666        Get();
 667        while lookahead = 'x' do  // globalDecl first
 668          localDecl()
 669          Expect(';')              // globalDecl follow
 670        wend
 671      endif
 672      Expect('b');
 673      cacheinit()
 674      printf(".globl\t%s\n", idfunc)
 675      printf("\t.type\t%s, @function\n", idfunc)
 676      printf("%s:\n", idfunc)
 677      printf("\tpushl\t%%ebp\n")
 678      printf("\tmovl\t%%esp, %%ebp\n")
 679      if frame < 0 then
 680        printf("\tsubl\t$%d, %%esp\n", -frame)
 681      endif
 682      stmtList(); Expect('e')
 683      printf("\tleave\n")
 684      printf("\tret\n")
 685      printf("\t.size %s, .-%s\n", idfunc, idfunc)
 686      locals[0] = 0   // init locals
 687    end

func localDecl()

localDecl =
  IDENT ':' (
    | "char"
    | "array" CONST (
      | "char"
  ) . 

 689    func localDecl(): void
 690    begin
 691      Expect('x'); strcpy(iddecl, token)
 692      Expect(':')                // syntactic sugar
 693      if lookahead = 'I' then
 694        Get()
 695        frame = frame - 4
 696        addlocal(iddecl, 'j', frame)
 697      else if lookahead = 'C' then
 698        Get()
 699        frame = frame - 1
 700        addlocal(iddecl, 'd', frame)
 701      else if lookahead = 'a' then
 702        Get(); Expect('n')
 703        if lookahead = 'I' then
 704          Get()
 705          frame = frame - number * 4
 706          addlocal(iddecl, 'J', frame)
 707        else if lookahead = 'C' then
 708          Get()
 709          frame = frame - number
 710          addlocal(iddecl, 'D', frame)
 711        else; error("int, char expected"); endif; endif
 712      else; error("int, char, array expected"); endif; endif; endif
 713    end

func stmtList()

stmtList =
  { [ statement ] ';' }

Some valid examples for the stmtlist EBNF:
The empty set
statement ';'
statement ';' statement ';'

 715    func stmtList(): void
 716    begin
 717      while in(lookahead, "x;iWrB") do
 718        if in(lookahead, "xiWrB") then
 719          statement()
 720        endif
 721        Expect(';')
 722      wend
 723    end

The first set for stmtList() is IDENT, ';' "if", "while", "return" and "break".

func statement()

statement =
    [ '[' expr ']' ] '=' expr
    | exprList 
  | "if" expr "then" stmtList [ "else" stmtList ] "endif"
  | "while" expr "do" stmtList "wend"
  | "return" [ expr ]            
  | "break"

Function statement() is the second longest function after yylex(). Again it can be split in several sections that work independent of each other. First we show the control flow of statement(), then we show the sections:

 725    func statement(): void
 726    var L1: int   // if jump label
 727      L2: int  
 728    begin
 729      if lookahead = 'x' then

 743      else if lookahead = 'i' then

 765      else if lookahead = 'W' then

 782      else if lookahead = 'r' then

 793      else if lookahead = 'B' then

 800      else error("bad statement"); endif; endif; endif; endif; endif
 801    end

The control flow of statement() is simple. Just check the reserved words in a if-then-else-if ... multiple selection.

statement IDENT

 730        Get()
 731        if (lookahead = '[') | (lookahead = '=') then
 732          strcpy(idvar, token)
 733          typevar = gettype(offsetvar, idvar)
 734          if lookahead = '[' then
 735            Get(); expr(); Expect(']')
 736          endif
 737          Expect('='); expr(); assign()
 738        else if lookahead = '(' then
 739          callid(); exprList()
 740          call()
 741          cacheinit()
 742        else error("bad statement1"); endif; endif

Lines 731 to 737 handle the assignment to a variable case. Lines 738 to 741 handle the procedure call case.

statement if

 744        Get(); expr(); Expect('t')
 745        level = level + 1
 746        L1 = label; label = label + 1
 747        L2 = L1
 748        printf("\tandl\t%%e%cx, %%e%cx", reg[0], reg[0])
 749        printf("\t#then_jump %d\n", level)
 750        printf("\tjz\t.L%d\n", L1)
 751        cacheinit()
 752        stmtList()
 753        if lookahead = 'l' then
 754          Get()
 755          L2 = label; label = label + 1 
 756          printf("\tjmp\t.L%d", L2)
 757          printf("\t\t#else_jump %d\n", level)
 758          printf(".L%d:\n", L1)
 759          stmtList()
 760        endif
 761        Expect('E')
 762        printf(".L%d:", L2)
 763        printf("\t\t\t\t#endif_label %d\n", level)
 764        level = level - 1

The structured if command is translated into two jumps. The jump labels (jump targets) are created in lines 746 and 747. At this moment the two jump labels are equal. A if-then-endif command uses only one jump label. The "jump if zero" opcode that is emitted in line 750 is the conditional jump around the assembler code between the then and endif block. A if-the-else-endif command uses two different jump labels. In line 755 L2 is made different to L1. Now the jump to L1 is the jump around the then-else block to the else-endif block. The last assembler code instruction of the then-else block is the unconditional jump to endif that is emitted in line 756. With L1=1 and L2=1 we get the following assembler code for an if-then-endif. Note: the BASICO source code as comment is below the assembler code

        movl    a, %eax
        andl    %eax, %eax      #then_jump 0
        jz      .L1
#   if a then
        call    aNotEqual0
#     aNotEqual0()
.L1:                            #endif_label 0
#   endif

A two-way selection with if-then-else-endif gives the following assembler code with L1=1 and L2=2:

        movl    a, %eax
        andl    %eax, %eax      #then_jump 0
        jz      .L1
#   if a then
        call    aNotEqual0
#     aNotEqual0()
        jmp     .L2             #else_jump 0
#   else
        call    aEqual0
#     aEqual0()   
.L2:                            #endif_label 0
#   endif

statement while

 766        Get()
 767        whilelevel = whilelevel + 1
 768        L3[whilelevel] = label; label = label + 1
 769        L4[whilelevel] = label; label = label + 1
 770        printf(".L%d:", L3[whilelevel])
 771        printf("\t\t\t\t#while_label %d\n", whilelevel)
 772        expr(); Expect('d')
 773        printf("\tandl\t%%e%cx, %%e%cx", reg[0], reg[0])
 774        printf("\t#do_jump %d\n", whilelevel)
 775        printf("\tjz\t.L%d\n", L4[whilelevel])
 776        cacheinit()
 777        stmtList(); Expect('w')
 778        printf("\tjmp\t.L%d", L3[whilelevel])
 779        printf("\t\t#wend_jump %d\n", whilelevel)
 780        printf(".L%d:\n", L4[whilelevel])
 781        whilelevel = whilelevel - 1

The while command needs two jump labels, too. One conditional jump (L4) to the wend if the while condition fails. One unconditional jump (L3) from the end of the loop back to the start of the loop. Because the break statement has to know the jump label of the while loop, the jump labels are in a compiler maintained stack. The stack-index is variable whilelevel. In line this variable is incremented (add a new while level), in line 781 it is decremented (finish the while level).

statement return

 783        Get();
 784        if in(lookahead, "xnsc(-~") then
 785          expr()
 786          // return value in reg eax
 787          if reg[0] # 'a' then
 788            printf("\tmovl   %%e%cx, %%eax\n", reg[0])
 789          endif
 790        endif
 791        printf("\tleave\n")
 792        printf("\tret\n")

The return statement always emits the two opcode in lines 791 and 792. The leave opcode frees the local variable space by correcting the CPU stack pointer value and pops the frame pointer. The opcode in line 788 is only emitted if there is a return value and the return value is not already in the %eax CPU register, the return register of C library functions.

statement break

 794        Get()
 795        if whilelevel < 0 then
 796          error("break without while")
 797        endif
 798        printf("\tjmp\t.L%d", L4[whilelevel])
 799        printf("\t#break_jump %d\n", whilelevel)

The break statement is the most simple. Without the "break without while" check it is just line 798. A jump to the wend of the innermost while loop.

Next chapter BASICO source code reading - Parser 2
Previous chapter BASICO source code reading - Scanner
Back to BASICO title page