BASICO compiler source code reading - Parser 1
Author: Andre Adrian
Version: 02.jul.2006
This part handles the command parser, the next part covers the
expression parser. The command parser implements the variable
declaration, function definition and the control flow statements. The
expression parser handles the unary and binary operators like ~, -, +,
* and the variable load and store.
func basico()
Before we talk about the statements in the basico() function, let's
look
at the Extended Backus-Naur-Form (EBNF) grammar of basico():
basico =
[ "var" { globalDecl ';' } ]
{ "func" IDENT '(' paramList ')' ':' returnType ';'
blockOrForward ';' }
.
The basico definition starts with an optional "var"
statement. The meta-symbol [ ] is 0 to 1 repetition of the enclosed
pattern. After the reserved word "var" there can be 0 or more
globalDecl that are followed each by an ';'. The meta-symbol { } is 0
to N
repetitions. Literal symbols like "var" and ';' are enclosed in " " or
' ' and stay for themselves. The word globalDecl is a definition, like
basico is. The most simple example for a "var" statement is the empty
string, just write nothing. Other examples are:
var i:int;
var i:int; k:char; m:array 10
char;
var i:int;
k:char;
m:array 10
char;
Again, in BASICO the scanner changes linefeed to ';' if necessary. Most
of the ';' we don't have to type and we should not type in an actual
BASICO source code file.
Second part of the basico definition is the 0 to N times repetition of
a "func", a function. The most simple BASICO program is the empty
string, because the "var" and the "func" part are both valid with 0
times repetition. But now a closer look at a function.
A function starts with the reserved word func. The " " around func just
says: I am a literal. Second item is IDENT. See section Identifier for
the definition of IDENT. The third item in the func grammar is
the
character (. The ' ' around (
says again: I am a literal. The fourth item is paramList. Because it is
no literal, it has to be explained somewhere else in the grammar.
We look at this later. More important for the moment is to recognize
the pattern in our EBNF. The EBNF of func is a template with literals
and fill-ins. Some fill-ins are simple regular expressions like IDENT,
other
fill-ins are EBNFs. These other EBNFs have to come down to regular
expressions and regular expressions have to come down to ASCII
characters. In
the end everything can be expressed with ASCII characters. To translate
from
one language to another, the compiler has to do two jobs: first analyse
the source code, then synthesize the assembler code. The analysis is
controlled by the syntax, the EBNFs and REs of the programming
language. The synthesis just follows syntax. We speak of "syntax driven
translation". The analysis of BASICO is simple, because the grammar is
context-free and predictive. Our human languages are not context-free
nor predictive. This is the reason why machine translation is working
great in the area of programming languages and is working poor in the
area of human languages.
560 func
basico(): void
561 begin
562 if lookahead = 'v' then
563 Get()
564 while lookahead = 'x'
do // globalDecl first
565 globalDecl()
566
Expect(';')
// globalDecl follow
567 wend
568 endif
569 while lookahead = 'F' do
570 Get(); Expect('x')
571 strcpy(idfunc, token)
572 frame = 0
573 param = 8
574 Expect('(');
paramList(); Expect(')')
575 Expect(':');
returnType(); Expect(';')
576 blockOrForward();
Expect(';')
577 wend
578 end
The EBNF option meta-symbols [ ] are translated into an if command in
the
source code of a parser. The repetition meta-symbols { } are translated
to a while command. The EBNF [ "var" { globalDecl ';' } ] is coded in
lines 562 to 568. A EBNF definition like globalDecl is translated into
a function call. The functions Get() and Expect() are explained next.
Line 562 is easy. The one-character-representation of "var" is 'v', and
the line is just the necessary if to realize an EBNF option. Line 564
has the while statement that we need to implement a repetition. The
compare in line 564 is lookahead = 'x', which we can convert to
lookahead = IDENT. In the EBNF of "var" there is on IDENT at this
location, we read globalDecl.
One of the tricky parts of translation a EBNF into the source code of a
parser is to find the first sets. What is a first set? Well,
the source code of our parser can not compare the lookahead to some
abstract entity like globalDecl. The parser can only compare to some
token. The first set are just all tokens that can start a globalDecl. A
compiler compiler like CoCo/R has an algorithm to find the first sets
in the grammar.
If you like to do it by hand you have to read down the EBNFs to find
the real tokens. The EBNF [ "var" { globalDecl ';' } ] uses the
globalDecl EBNF. The globalDecl definition is:
globalDecl =
IDENT ':' (
"int"
| "char"
| "array"
CONST (
"int"
|
"char"
)
) .
The only start token for globalDecl is IDENT. Therefore the first set
of globalDecl is < IDENT >. The < > are set brackets.
Line 569 to 577 codes the EBNF { "func" IDENT '(' paramList ')' ':'
returnType ';' blockOrForward ';' }. Line 569 is the while statement to
realize the { } repetion. The Expect('x') in line 570 reads the IDENT.
Lines 571 to 573 are part of the synthesis, the production of assembler
code.
Line 571 stores the IDENT attribute (variable name) into idfunc. Line
572 sets variable frame to 0. Variable frame contains the memory size
for local variables. Every local variable that is found while parsing
the function increases frame.
44
frame:
int
// locals
Line 573 inits variable param to 8. Variable param contains the offset
for the function arguments. The first argument has an offset of 8. This
value is CPU specific. On the 80386 CPU a function call of the function
x(a, b, c) has the following pseudo code sequence:
push argument c
push argument b
push argument a
push return address, jump to x
push frame pointer
Every push needs 4 bytes (one 32bit word) on the stack. Because the
stack grows down, that is every push decrements the stack pointer, the
arguments have a positive offset. The argument a is found at
stackpointer+8 location, the argument b is found at stackpointer+12
location.
Lines 574 to 576 codes the EBNF part '(' paramList ')' ':' returnType
';' blockOrForward ';' which does not contain new features. The EBNF
definitions paramList, returnType and blockOrForward are realized as
function calls. The literals '(', ')', ':' and ';' are read with
Expect().
func Get()
546 func
Get(): void
547 var
s: array 16 char
548 begin
549
lookahead = yylex()
550 end
Function Get() gets one token from yylex() and stores this token in
variable lookahead. The local variable s is not needed. This is one of
the typical left-overs from a copy/paste/modify session. We use Get()
if it is not necessary to check the token we eat again.
func Expect()
552 func
Expect(t: int): void
553 var
s: array 16 char
554 begin
555
if lookahead = t then
556
Get()
557
else sprintf(s, "%c expected", t); error(s); endif
558 end
Function Expect() eats a token after a check. If the check fails, the
compiler is terminated with a "XXX expected" message. The expected
token is given as argument. The actual token out of the source code
file is in variable lookahead. The call to Get() eats the token and
loads a new token into variable lookahead.
func globalDecl()
The EBNF of globalDecl has two new meta-symbols. The ( ) brackets
enclose a group. A group is an un-named EBNF definition. The |
meta-synbols is alternative.
globalDecl =
IDENT ':' (
"int"
| "char"
| "array" CONST (
"int"
| "char"
)
) .
The EBNF without ( ) meta-symbols is:
globalDecl = IDENT ':'
globalDecl1 .
globalDecl1 = "int" | "char" | "array" CONST globalDecl2 .
globalDecl2 = "int" | "char" .
Both EBNFs are equal. The second version is easier to read.
Valid examples for globalDecl are:
a: int
b: char
c: array 10 int
d: array 20 char
As synthesis action the variable declaration has to reserve memory in
global memory space in the assembler file with the name of the IDENT.
Another synthesis action is to remember the type of the variable. An
integer variable needs other assembler op-codes for load and store then
a character or array variable, so we need the type information later on
in variable access, variable assignment and use of variable as
parameter in a function call.
580 func
globalDecl(): void
581 begin
582
Expect('x'); strcpy(iddecl, token)
583
Expect(':')
// syntactic sugar
584
if lookahead = 'I' then
585
Get()
586
printf("\t.comm %s,4,4\n", iddecl)
587
addglobal(iddecl, 'i')
588
else if lookahead = 'C' then
589
Get()
590
printf("\t.comm %s,1,1\n", iddecl)
591
addglobal(iddecl, 'c')
592
else if lookahead = 'a' then
593
Get(); Expect('n')
594
if lookahead = 'I' then
595
Get()
596
printf("\t.comm %s,%d,4\n",
iddecl, number * 4)
597
addglobal(iddecl, 'I')
598
else if lookahead = 'C' then
599
Get()
600
printf("\t.comm %s,%d,1\n",
iddecl, number)
601
addglobal(iddecl, 'C')
602
else; error("int, char expected"); endif; endif
603
else; error("int, char, array expected"); endif; endif; endif
604 end
Line 582 and 583 eat the IDENT ':' part of the globalDecl EBNF. As
synthesis action the IDENT attribute token (the variable name) is
stored in iddecl in line 582. The EBNF alternative is realized as an
if-then-else-if-then-else...endif-endif construct. The lines 584, 588
and 592 start the alternatives "int", "char" and "array". For
alternative "int" line 585 eats the "int" token. Line 586 produces the
memory reservation in the assembler file with the C library function
printf(). Line 587 calls addglobal() to save the variable name together
with variable type for later use.
Lines 589 to 591 do the same for character what lines 585 to 587 do for
integer.
Line 593 eats the "array" token and eats the interger literal for array
size. After Expect('n') the integer literal is in variable number. In
line 596 variable number is used to calculate the memory size for the
integer array. Line 597 calls addglobal() with variable name and
variable type, but without array size. A PASCAL compiler would save the
array size for all the automatic buffer overwrite checks it wants to
do. A C compiler would save it for the sizeof() compiler function which
tells the size of the argument.
Line 602 contains a call to error() that is executed if the parser does
not find the expected "int" or "char" as last tokens of an array
variable declaration. Line 603 contains an error() call that is
triggered if after the ':' the parser does not find "int", "char" or
"array".
func addglobal()
326 func
addglobal(name: array char, type: char): int
327 var
id: int
328 begin
329
id = newname(globals, name)
330
globaltype[id] = type
331
return id
332 end
Function addglobal() is an enriche/modify function. The real work is
done in newname(). Function addglobal() selects the variable globals as
symbol table for newname(), which make sense because in variable
globals we store the names of all global variables. In line 330 the
return value of newname is used as index for the array globaltype. The
type information that is argument type to function addlocal() is stored
in globaltype under the proper index. Line 331 returns with id.
func newname()
309 func
newname(symtab: array char, name: array char): int
310 var
ident: int
311 begin
312
ident = lookup(symtab, name)
313
if ident >= 0 then
314
return ident
315
else
316
if strlen(symtab) + strlen(name) + 2 <= 1000 then
317
strcat(symtab, name)
318
strcat(symtab, " ")
319
else
320
error("symtab full")
321
endif
322
return lookup(symtab, name)
323
endif
324 end
Function newname() adds a new variable name to the symbol table symtab.
Before the add, the symbol table is checked with lookup() if this
variable name does not exist already. In BASICO multiple declaration is
forbidden. It is allowed to have a global variable i and a local
variable i. But a global variable i that is first declared as an
integer and then declared as an character does not make sense.
Unfortunely the current newname() implementation does not support this
behaviour. The lookup() check is done in line 312, but instead of an
error("multiple declaration") in line 314 there is a return ident. This
is a real bug in version 0.8 of BASICO.
Lines 316 to 321 concatenate the name with a trailing space to the
symbol table, or terminate with error("symtab full"). The C library
function strlen() returns the string length without the NUL character.
The string "abc" needs 4 characters for store, but strlen() will return
3.
Line 322 calls lookup() again to get the index of the just added name
in the symbol table. This call to lookup() can be eliminated with two
variables globalid and localid that get incremented in addglobal() and
addlocal(). This is an example of the famous speed/memory trade-off: We
spend time to run a function like lookup() to calculate some
information or we use some variables like globalid and localid to have
the information already present. Still the variables have to be updated
and this is the nasty part of all the use memory to improve speed
optimizations: often the variables get not updated in all cases and run
out of synchronization.
The BASICO compiler did not run faster with the globalid, localid
optimizations, so the optimizations were thrown out again. An
optimization that does not save time is only an unnecessary
complication of the source code. Always say to your source code: Earn
your money or leave my file!
Note: If your payment is on a lines of code basis, this advice is bad
for you. If you code for high Reliability/Maintainability/Availability
(RMA) this is the best advice you ever get.
func paramList()
paramList =
[ paramDecl { ','
paramDecl } ]
.
606 func
paramList(): void
607 begin
608
if lookahead = 'x' then
609
paramDecl()
610
while lookahead = ',' do
611
Get()
612
paramDecl()
613
wend
614
endif
615 end
func paramDecl()
paramDecl =
IDENT ':' (
"int"
| "char"
| "array" (
"int"
|
"char"
)
) .
637 func
paramDecl(): void
638 begin
639
Expect('x'); strcpy(iddecl, token)
640
Expect(':')
641
if lookahead = 'I' then
642
Get()
643
addlocal(iddecl, 'j', param)
644
param = param + 4
645
else if lookahead = 'C' then
646
Get()
647
addlocal(iddecl, 'd', param)
648
param = param + 4
649
else if lookahead = 'a' then
650
Get()
651
if lookahead = 'I' then
652
Get()
653
addlocal(iddecl, 'K', param)
654
param = param + 4
655
else if lookahead = 'C' then
656
Get()
657
addlocal(iddecl, 'E', param)
658
param = param + 4
659
else; error("int, char expected"); endif; endif
660
else; error("int, char, array expected"); endif; endif; endif
661 end
func returnType()
returnType =
"int"
|
"char"
|
"void"
.
617 func
returnType(): void
618 begin
619
if lookahead = 'I' then
620
Get(); printf("\t\t\t\t#func returns int\n")
621
else if lookahead = 'C' then
622
Get(); printf("\t\t\t\t#func returns char\n")
623
else if lookahead = 'V' then
624
Get(); printf("\t\t\t\t#func returns void\n")
625
else; error("bad returnType"); endif; endif; endif
626 end
func blockOrForward()
blockOrForward =
pblock3
| "forward"
.
628 func
blockOrForward(): void
629 begin
630
if (lookahead = 'v') | (lookahead = 'b') then
631
pblock3()
632
else if lookahead = 'f' then
633
Get(); printf("\t\t\t\t#forward decl\n")
634
else error("bad blockOrForward"); endif; endif
635 end
func pblock3()
pblock3 =
[ "var" { localDecl ';' } ]
"begin" stmtList "end"
.
663 func
pblock3(): void
664 begin
665
if lookahead = 'v' then
666
Get();
667
while lookahead = 'x' do // globalDecl first
668
localDecl()
669
Expect(';')
// globalDecl follow
670
wend
671
endif
672
Expect('b');
673
cacheinit()
674
printf(".globl\t%s\n", idfunc)
675
printf("\t.type\t%s, @function\n", idfunc)
676
printf("%s:\n", idfunc)
677
printf("\tpushl\t%%ebp\n")
678
printf("\tmovl\t%%esp, %%ebp\n")
679
if frame < 0 then
680
printf("\tsubl\t$%d, %%esp\n", -frame)
681
endif
682
stmtList(); Expect('e')
683
printf("\tleave\n")
684
printf("\tret\n")
685
printf("\t.size %s, .-%s\n", idfunc, idfunc)
686
locals[0] = 0 // init locals
687 end
func localDecl()
localDecl =
IDENT ':' (
"int"
| "char"
| "array"
CONST (
"int"
|
"char"
)
) .
689 func
localDecl(): void
690 begin
691
Expect('x'); strcpy(iddecl, token)
692
Expect(':')
// syntactic sugar
693
if lookahead = 'I' then
694
Get()
695
frame = frame - 4
696
addlocal(iddecl, 'j', frame)
697
else if lookahead = 'C' then
698
Get()
699
frame = frame - 1
700
addlocal(iddecl, 'd', frame)
701
else if lookahead = 'a' then
702
Get(); Expect('n')
703
if lookahead = 'I' then
704
Get()
705
frame = frame - number * 4
706
addlocal(iddecl, 'J', frame)
707
else if lookahead = 'C' then
708
Get()
709
frame = frame - number
710
addlocal(iddecl, 'D', frame)
711
else; error("int, char expected"); endif; endif
712
else; error("int, char, array expected"); endif; endif; endif
713 end
func stmtList()
stmtList =
{ [ statement ] ';' }
.
Some valid examples for the stmtlist EBNF:
The empty set
';'
statement ';'
statement ';' statement ';'
715 func
stmtList(): void
716 begin
717
while in(lookahead, "x;iWrB") do
718
if in(lookahead, "xiWrB") then
719
statement()
720
endif
721
Expect(';')
722
wend
723 end
The first set for stmtList() is IDENT, ';' "if", "while", "return" and
"break".
func statement()
statement =
IDENT (
[ '[' expr ']'
] '=' expr
|
exprList
)
| "if" expr "then"
stmtList [ "else" stmtList ] "endif"
| "while" expr "do"
stmtList "wend"
| "return" [ expr
]
| "break"
.
Function statement() is the second longest function after yylex().
Again it can be split in several sections that work independent of each
other. First we show the control flow of statement(), then we show the
sections:
725 func
statement(): void
726 var
L1: int // if jump label
727
L2: int
728 begin
729
if lookahead = 'x' then
743
else if lookahead = 'i' then
765
else if lookahead = 'W' then
782
else if lookahead = 'r' then
793
else if lookahead = 'B' then
800
else error("bad statement"); endif; endif; endif; endif; endif
801 end
The control flow of statement() is simple. Just check the reserved
words in a if-then-else-if ... multiple selection.
statement IDENT
730
Get()
731
if (lookahead = '[') | (lookahead = '=') then
732
strcpy(idvar, token)
733
typevar = gettype(offsetvar, idvar)
734
if lookahead = '[' then
735
Get(); expr(); Expect(']')
736
endif
737
Expect('='); expr(); assign()
738
else if lookahead = '(' then
739
callid(); exprList()
740
call()
741
cacheinit()
742
else error("bad statement1"); endif; endif
Lines 731 to 737 handle the assignment to a variable case. Lines 738 to
741 handle the procedure call case.
statement if
744
Get(); expr(); Expect('t')
745
level = level + 1
746
L1 = label; label = label + 1
747
L2 = L1
748
printf("\tandl\t%%e%cx, %%e%cx", reg[0], reg[0])
749
printf("\t#then_jump %d\n", level)
750
printf("\tjz\t.L%d\n", L1)
751
cacheinit()
752
stmtList()
753
if lookahead = 'l' then
754
Get()
755
L2 = label; label = label + 1
756
printf("\tjmp\t.L%d", L2)
757
printf("\t\t#else_jump %d\n", level)
758
printf(".L%d:\n", L1)
759
stmtList()
760
endif
761
Expect('E')
762
printf(".L%d:", L2)
763
printf("\t\t\t\t#endif_label %d\n", level)
764
level = level - 1
The structured if command is translated into two jumps. The jump labels
(jump targets) are created in lines 746 and 747. At this moment the two
jump labels are equal. A if-then-endif command uses only one jump
label. The "jump if zero" opcode that is emitted in line 750 is the
conditional jump around the assembler code between the then and endif
block. A if-the-else-endif command uses two different jump labels. In
line 755 L2 is made different to L1. Now the jump to L1 is the jump
around the then-else block to the else-endif block. The last assembler
code instruction of the then-else block is the unconditional jump to
endif that is emitted in line 756. With L1=1 and L2=1 we get the
following assembler code for an if-then-endif. Note: the BASICO source
code as comment is below the assembler code
movl a, %eax
andl %eax, %eax
#then_jump 0
jz .L1
# if a then
call aNotEqual0
#
aNotEqual0()
.L1:
#endif_label 0
# endif
A two-way selection with if-then-else-endif gives the following
assembler code with L1=1 and L2=2:
movl a, %eax
andl %eax, %eax
#then_jump 0
jz .L1
# if a then
call aNotEqual0
#
aNotEqual0()
jmp
.L2
#else_jump 0
.L1:
# else
call aEqual0
#
aEqual0()
.L2:
#endif_label 0
# endif
statement while
766
Get()
767
whilelevel = whilelevel + 1
768
L3[whilelevel] = label; label = label + 1
769
L4[whilelevel] = label; label = label + 1
770
printf(".L%d:", L3[whilelevel])
771
printf("\t\t\t\t#while_label %d\n", whilelevel)
772
expr(); Expect('d')
773
printf("\tandl\t%%e%cx, %%e%cx", reg[0], reg[0])
774
printf("\t#do_jump %d\n", whilelevel)
775
printf("\tjz\t.L%d\n", L4[whilelevel])
776
cacheinit()
777
stmtList(); Expect('w')
778
printf("\tjmp\t.L%d", L3[whilelevel])
779
printf("\t\t#wend_jump %d\n", whilelevel)
780
printf(".L%d:\n", L4[whilelevel])
781
whilelevel = whilelevel - 1
The while command needs two jump labels, too. One conditional jump (L4)
to the wend if the while condition fails. One unconditional jump (L3)
from the end of the loop back to the start of the loop. Because the
break statement has to know the jump label of the while loop, the jump
labels are in a compiler maintained stack. The stack-index is variable
whilelevel. In line this variable is incremented (add a new while
level), in line 781 it is decremented (finish the while level).
statement return
783
Get();
784
if in(lookahead, "xnsc(-~") then
785
expr()
786
// return value in reg eax
787
if reg[0] # 'a' then
788
printf("\tmovl
%%e%cx, %%eax\n", reg[0])
789
endif
790
endif
791
printf("\tleave\n")
792
printf("\tret\n")
The return statement always emits the two opcode in lines 791 and 792.
The leave opcode frees the local variable space by correcting the CPU
stack pointer value and pops the frame pointer. The opcode in line 788
is only emitted if there is a return value and the return value is not
already in the %eax CPU register, the return register of C library
functions.
statement break
794
Get()
795
if whilelevel < 0 then
796
error("break without while")
797
endif
798
printf("\tjmp\t.L%d", L4[whilelevel])
799
printf("\t#break_jump %d\n", whilelevel)
The break statement is the most simple. Without the "break without
while" check it is just line 798. A jump to the wend of the innermost
while loop.
Next chapter BASICO source code reading - Parser 2
Previous chapter BASICO source code reading
- Scanner
Back to BASICO title page