BASICO compiler source code reading - Parser 2
Author: Andre Adrian
Version: 09.jul.2006
The functions of this
section produce most of the assembler
code. The condition,
arithmetric and logic operations in expr(), addexpr(), term(),
unaryfact() and fact() produce "the working" opcodes like add and mul
and a lot of transport opcodes (mov).
func gettype()
294 func
gettype(offset: array int, name: array char): int
295 var
id: int
296 begin
297
id = lookup(locals, name)
298
if id >= 0 then
299
offset[0] = localoffset[id]
300
return localtype[id]
301
endif
302
id = lookup(globals, name)
303
if -1 = id then
304
error("variable not found")
305
endif
306
return globaltype[id]
307 end
Function gettype() has two output values. The variable type is returned
through the return value. The variable offset is returned through the
call-by-reference array variable offset. The first lookup() for local
variables is in line 297. The second lookup() for global variables is
in line 302. This implements the expected local variables are used
instead of global variables in case of identical name.
The variable type is coded into a single-character-representation. The
coding is:
global integer |
global integer array |
local integer |
local integer array |
call-by-reference integer array |
i
|
I
|
j
|
J
|
K
|
global char |
global char array |
local char |
local char array |
call-by-reference char array |
c
|
C
|
d
|
D
|
E
|
func assign()
434 func
assign(): void
435 begin
436
if 'i' = typevar then
437
printf("\tmovl\t%%e%cx, %s\n", reg[0], idvar)
438
else if 'j' = typevar then
439
printf("\tmovl\t%%e%cx, %d(%%ebp)\n", reg[0],
offsetvar[0])
440
else if 'I' = typevar then
441
printf("\tmovl\t%%e%cx, %s(,%%e%cx,4)\n", reg[0],
idvar, reg[1])
442
pop()
443
else if 'J' = typevar then
444
printf("\tmovl\t%%e%cx, %d(%%ebp,%%e%cx,4)\n",
reg[0], offsetvar[0], reg[1])
445
pop()
446
else if 'K' = typevar then
447
printf("\tmovl\t%d(%%ebp), %%esi\n", offsetvar[0])
448
printf("\tmovl\t%%e%cx, (%%esi,%%e%cx,4)\n", reg[0],
reg[1])
449
pop()
450
else if 'c' = typevar then
451
printf("\tmovb\t%%%cl, %s\n", reg[0], idvar)
452
else if 'd' = typevar then
453
printf("\tmovb\t%%%cl, %d(%%ebp)\n", reg[0],
offsetvar[0])
454
else if 'C' = typevar then
455
printf("\tmovb\t%%%cl, %s(%%e%cx)\n", reg[0], idvar,
reg[1])
456
pop()
457
else if 'D' = typevar then
458
printf("\tmovb\t%%%cl, %d(%%e%cx,%%ebp)\n", reg[0],
offsetvar[0], reg[1])
459
pop()
460
else if 'E' = typevar then
461
printf("\tmovl\t%d(%%ebp), %%esi\n", offsetvar[0])
462
printf("\tmovb\t%%%cl, (%%esi,%%e%cx)\n", reg[0],
reg[1])
463
pop()
464
endif; endif; endif; endif; endif; endif; endif; endif; endif;
endif
465
pop()
466 end
Function assign() selects the correct transport opcode depending on
variable type. The 80386 transport opcodes are quite powerful. The
base_of_array + size_of_array_element * array_index calculation is done
as part of the store mov opcode. Only in case of call-by-reference
assignment, we need two opcodes and the CPU register %esi to done the
indirect store.
func pop()
413 func
pop(): void
414 var
i: int
415
tmp: char
416 begin
417
se = se - 1
418
if se = 0 then
419
cacheinit()
420
else
421
i = 0
422
tmp = reg[i]
423
while i < 4 - 1 do
424
reg[i] = reg[i + 1]
425
i = i + 1
426
wend
427
reg[i] = tmp
428
endif
429
if se >= 4 then
430
printf("\tpopl\t%%e%cx\n", reg[i])
431
endif
432 end
The function pop() is one part of the stack-cache implementation, the
other part is function push(). The real CPU stack is used with the popl
opcode in line 430 if more then 4 stack elements (variable se) are
needed. If pop() is called, the old top-of-stack element is dropped and
the fourth-of-stack element is loaded from real stack to stack cache.
func callid()
359 func
callid(): void
360 var
i: int
361 begin
362
i = se
363
while i >= 1 do
364
i = i - 1
365
printf("\tpushl\t%%e%cx\n", reg[i])
366
wend
367
oldse = se
368
se = 0
369
strcpy(idcall, token)
370
args = 0
371 end
In the BASICO statement a = b + c(x, y) the stack cache is used for two
purposes: First to code the assignment and the expression (the
statement fragment a = b + ). Second to handle the function arguments
of c(x, y). Function callid() pushes the stack cache to the "real"
stack at the moment the parser sees c(, a function. The stack elements
se is stored in variable oldse, se is set to 0 to init se for the
second purpose. This implementation can not handle the following
statements:
srand(time(0)) use the time(0) function
return value as argument for function srand()
a = c(x, y) + d(x, y, z) handle two
functions with different number of arguments.
func call()
381 func
call(): void
382 var
i: int
383 begin
384
// C pushes args from right to left
385
i = 0
386
while i < se do
387
printf("\tpushl\t%%e%cx\n", reg[i])
388
i = i + 1
389
wend
390
printf("\tcall\t%s\n", idcall)
391
if args > 0 then
392
printf("\taddl\t$%d, %%esp\n", 4 * args)
393
endif
394 end
The C library functions expect the function arguments on the stack in
the order that the left-most argument is the top-of-stack argument. The
arguments have to be pushed right-to-left on the stack. This requests
normally a two-pass parsing of the function arguments. First pass to
get the number of arguments, second pass to produce the assembler code
with the proper stack locations of the arguments.
The BASICO compiler uses the stack-cache to reverse the arguments.
After parsing the arguments left-to-right the rightmost argument is
top-of-stack. The while loop in lines 385 to 389 pushes the
top-of-stack-cache first (rightmost argument) and the
fourth-of-stack-cache last (leftmost argument). This trick avoids two
pass parsing of function arguments, but limits the number of function
arguments to the number of CPU registers in the stack-cache, which is 4.
func exprList()
exprList =
'(' [ expr { ',' expr } ]
')'
.
Valid examples for exprList EBNF are:
'(' ')'
'(' a ')'
'(' a, b ')'
829 func
exprList(): void
830 begin
831
Expect('(')
832
if in(lookahead, "xnsc(-~") then
833
expr(); callarg()
834
while lookahead = ',' do
835
Get(); expr(); callarg()
836
wend
837
endif
838
Expect(')')
839 end
The first set for exprList is IDENT, CONST, STRCONST, CHRCONST, '(',
'-' and '~'.
func callarg()
373 func
callarg(): void
374 begin
375
args = args + 1
376
if args > 4 then
377
error("too many args")
378
endif
379 end
Function callarg() increments the number of function arguments in line
375. Because we have only 4 CPU registers for the stack-cache, more
then 4 function arguments are not allowed. See functions call() and
callid() for details.
func expr()
expr =
addexpr { '=' addexpr
| '#' addexpr
| '<' [ '=' ] addexpr
| '>' [ '=' ] addexpr
} .
The BASICO expr() is called condition or conditional_expression in
other compilers. A difference to standard compilers is the handling of
"<=" and
">=" tokens. In most compilers the scanner detects these 2-character
tokens. Like in Crenshaw's TINY the BASICO scanner detects two tokens:
a '>' and a '=' token for ">=". The BASICO parser has to
understand this.
The TINY approach simplifies the scanner.
The disadvantage is not so obvious: it is possible to write white space
between '>' and '='.
803 func
expr(): void
804 begin
805
addexpr()
806
while in(lookahead, "=#<>") do
807
if lookahead = '=' then
808
Get(); addexpr(); rel("equal", "sete")
809
else if lookahead = '#' then
810
Get(); addexpr(); rel("not equal",
"setne")
811
else if lookahead = '<' then
812
Get()
813
if lookahead = '=' then
814
Get(); addexpr(); rel("less
or equal", "setle")
815
else
816
addexpr(); rel("less",
"setl")
817
endif
818
else
819
Get()
820
if lookahead = '=' then
821
Get(); addexpr();
rel("greater or equal", "setge")
822
else
823
addexpr(); rel("greater",
"setg")
824
endif
825
endif; endif; endif
826
wend
827 end
The "working opcodes" in function expr() are create with the function
rel().
func rel()
535 func
rel(comment: array char, opcode: array char): void
536 begin
537
printf("\tcmpl\t%%e%cx, %%e%cx", reg[0], reg[1])
538
printf("\t#%s\n", comment)
539
printf("\t%s\t%%%cl\n", opcode, reg[1])
540
printf("\tmovzbl\t%%%cl, %%e%cx\n", reg[1], reg[1])
541
pop()
542 end
Function rel() is a functional macro. The function arguments comment
and opcode are build in a relational template. The assembler code is a
good example about CPU evolution (yes, evolution is everywhere, even in
CPU opcodes). In the old days, only short-circuit evaluation of
conditions was done, after the compare opcode cmpl there was a
conditional jump like jz for jump if zero. In BASICO compiler we set
integer variables depending on the outcome of compare. The setz opcode
sets the register to 1 if the compare was zero. Because only the lower
8 bits of the register are set by setz opcode we need the movzbl 8-bit
to 32-bit extend opcode.
func addexpr()
addexpr =
term { '+' term
| '-' term
| '|' term
| '^' term
} .
Now we come to the "working" opcodes. This is the level of assembler
programmers: addl, subl, orl and xorl for signed addition, signed
subtraction, binary inclusive or and binary exclusive or.
841 func
addexpr(): void
842 begin
843
term()
844
while in(lookahead, "+-|^") do
845
if lookahead = '+' then
846
Get(); term()
847
printf("\taddl\t%%e%cx, %%e%cx\n",
reg[0], reg[1])
848
pop()
849
else if lookahead = '-' then
850
Get(); term()
851
printf("\tsubl\t%%e%cx, %%e%cx\n",
reg[0], reg[1])
852
pop()
853
else if lookahead = '|' then
854
Get(); term()
855
printf("\torl\t%%e%cx, %%e%cx\n",
reg[0], reg[1])
856
pop()
857
else
858
Get(); term()
859
printf("\txorl\t%%e%cx, %%e%cx\n",
reg[0], reg[1])
860
pop()
861
endif ; endif ; endif
862
wend
863 end
func term()
term =
unaryfact { '*' unaryfact
| '/' unaryfact
| '%' unaryfact
| '&' unaryfact
} .
More working opcodes. The opcodes imul for signed multiplication and
andl for binary and are simple. The division and remainder opcodes are
created in the divmod() function. This is due to the
"non-orthogonality" of the 80386 opcodes. Orthogonal for a CPU is, that
registers are exchangeable. "All general registers are equal". In most
real (CISC) CPUs some general registers are more equal. See divmod()
for details.
865 func
term(): void
866 begin
867
unaryfact()
868
while in(lookahead, "*/%&") do
869
if lookahead = '*' then
870
Get(); unaryfact()
871
printf("\timul\t%%e%cx, %%e%cx\n",
reg[0], reg[1])
872
pop()
873
else if lookahead = '/' then
874
Get(); unaryfact()
875
divmod()
876
printf("\tmovl\t%%esi,
%%edx\n"); // restore edx
877
pop()
878
else if lookahead = '%' then
879
Get(); unaryfact()
880
divmod()
881
printf("\tmovl\t%%edx,
%%eax\n"); // use remainder
882
printf("\tmovl\t%%esi,
%%edx\n"); // restore edx
883
pop()
884
else
885
Get(); unaryfact()
886
printf("\tandl\t%%e%cx, %%e%cx\n",
reg[0], reg[1])
887
pop()
888
endif ; endif ; endif
889
wend
890 end
func divmod()
513 func
divmod(): void
514 var
tmp: int
515
i: int
516 begin
517
// idiv op-code uses edx and eax for 64bit dividend at input
518
// output is quotient in eax, remainder in edx
519
i = 0
520
while i < 4 do
521
if (i # 1) & ('a' = reg[i]) then
522
printf("\txchgl\t%%e%cx, %%e%cx\n",
reg[i], reg[1])
523
tmp = reg[i]
524
reg[i] = reg[1]
525
reg[1] = tmp
526
break
527
endif
528
i = i + 1
529
wend
530
printf("\tmovl\t%%edx, %%esi\n"); //
save edx
531
printf("\tcltd\n")
532
printf("\tidivl\t%%e%cx\n", reg[0])
533 end
The 80386 CPU has some funny ideas about register use for division. It
expects the upper 32 bits of the dividend in register %edx. The lower
32bits bits of the dividend have to be in register %eax. The 32bit
divisor can by in any general register, how nice! The quotient is again
in register %eax, the remainder is in %edx. Because the BASICO compiler
is very simple, the registers get loaded this way: The if command in
lines 521 to 527 exchanges the register %eax with another register to
bring the dividend into register %eax. Then %eax gets sign extended to
64 bits into %edx register with the cltd opcode in line 531. Before
sign extend the register contents of %edx is saved in register %esi.
func unaryfact()
unaryfact =
'-'
fact
| '~'
fact
| fact
.
The unary operator are '-' for minus or 2-ers complement and '~' for
bit reversal or 1-ers complement.
892 func
unaryfact(): void
893 begin
894
if lookahead = '-' then
895
Get(); fact()
896
printf("\tnegl\t%%e%cx\n", reg[0])
897
else if lookahead = '~' then
898
Get(); fact()
899
printf("\tnotl\t%%e%cx\n", reg[0])
900
else if in(lookahead, "xnsc(-~") then
901
fact()
902
else; error("bad unaryfact"); endif; endif; endif
903 end
func fact()
fact =
IDENT [ '[' expr ']' |
exprList ]
| CONST
|
CHRCONST
|
STRCONST
| '(' expr ')'
.
Function fact() is the third longest function in the compiler. Again
the different cases work quite independent.
905 func
fact(): void
906 var
i: int
907 begin
908
if lookahead = 'x' then
927
else if lookahead = 'n' then
931
else if lookahead = 'c' then
940
else if lookahead = 's' then
949
else if lookahead = '(' then
951
else error("bad fact"); endif; endif; endif; endif; endif
952 end
fact IDENT
909
Get()
910
if (lookahead = '(') | (lookahead = '[') then
911
if (lookahead = '[') then
912
strcpy(idarray, token);
Get(); expr(); Expect(']'); pusharrayvar()
913
else
914
callid(); exprList()
915
call()
916
se = oldse + 1
917
strcpy(reg,
"abcd"); //
return value in eax
918
i = 1
919
while i < se do
920
printf("\tpopl\t%%e%cx\n", reg[i])
921
i = i + 1
922
wend
923
endif
924
else
925
pushvar()
926
endif
fact CONST
928
Get()
929
push()
930
printf("\tmovl\t$%d, %%e%cx\n", number, reg[0])
Loading an integer literal is a simple load constant to register opcode.
fact CHRCONST
932
Get()
933
push()
934
printf("\tmovl\t$%d, %%e%cx", token[0], reg[0])
935
if token[0] >= ' ' then
936
printf("\t# '%c'\n", token[0])
937
else
938
printf("\n")
939
endif
Loading a character constant is a simple mov opcode in line 934. The if
command in lines 935 to 939 write a little comment to the assmbler file
to show the ASCII character if it is a printable ASCII character.
fact STRCONST
941
Get()
942
printf("\t.section .rodata\n")
943
printf(".LC%d:\n", lco)
944
printf("\t.string\t%s\n", token)
945
printf("\t.text\n")
946
push()
947
printf("\tmovl\t$.LC%d, %%e%cx\n", lco, reg[0])
948
lco = lco + 1
Loading a string constant is lengthy. The assembler needs a section
pseudo-opcode (line 942) and a string constant label (line 943). The
real opcode is again a simple move.
fact (expr)
950
Get(); expr(); Expect(')')
This is the recursive call to expr(). Without this statement BASICO and
any other high level language would be less powerful. Even FORTRAN, a
programming language that does not offer recursion needs recursion, or
a recursion emulation, for expression
evaluation.
func push()
396 func
push(): void
397 var
i: int
398
tmp: char
399 begin
400
i = 4 - 1
401
if se >= 4 then
402
printf("\tpushl\t%%e%cx\n", reg[i])
403
endif
404
tmp = reg[i]
405
while i > 0 do
406
reg[i] = reg[i - 1]
407
i = i - 1
408
wend
409
reg[0] = tmp
410
se = se + 1
411 end
The function push() is one part of the stack-cache implementation, the
other part is function pop(). The real CPU stack is used with the pushl
opcode in line 402 if more then 4 stack elements (variable se) are
needed. The fourth-of-stack element is pushed and becomes the new
top-of stack element.
func pushvar()
468 func
pushvar(): void
469 var
type: int
470
offset: array 1
int
// array for call-by-reference
471 begin
472
type = gettype(offset, token)
473
push()
474
if 'i' = type then
475
printf("\tmovl\t%s, %%e%cx\n", token, reg[0])
476
else if ('I' = type ) | ( 'C' = type) then
477
printf("\tmovl\t$%s, %%e%cx\n", token, reg[0])
478
else if 'j' = type then
479
printf("\tmovl\t%d(%%ebp), %%e%cx\n", offset[0],
reg[0])
480
else if ('J' = type ) | ( 'D' = type) then
481
printf("\tleal\t%d(%%ebp), %%e%cx\n", offset[0],
reg[0])
482
else if 'c' = type then
483
printf("\tmovzbl\t%s, %%e%cx\n", token, reg[0])
484
else if 'd' = type then
485
printf("\tmovzbl\t%d(%%ebp), %%e%cx\n", offset[0],
reg[0])
486
else if ('K' = type ) | ( 'E' = type) then
487
printf("\tmovl\t%d(%%ebp), %%e%cx\n", offset[0],
reg[0])
488
endif; endif; endif; endif; endif; endif; endif
489 end
Function pushvar() is the opposite of assign(). It gets the values from
memory into the stack-cache, the general purpose registers of the CPU.
func pusharrayvar()
491 func
pusharrayvar(): void
492 var
type: int
493
offset: array 1
int
// array for call-by-reference
494 begin
495
type = gettype(offset, idarray)
496
if 'I' = type then
497
printf("\tmovl\t%s(,%%e%cx,4), %%e%cx\n", idarray,
reg[0], reg[0])
498
else if 'J' = type then
499
printf("\tmovl\t%d(%%ebp,%%e%cx,4), %%e%cx\n",
offset[0], reg[0], reg[0])
500
else if 'K' = type then
501
printf("\tmovl\t%d(%%ebp), %%esi\n", offset[0])
502
printf("\tmovl\t(%%esi,%%e%cx,4), %%e%cx\n", reg[0],
reg[0])
503
else if 'C' = type then
504
printf("\tmovzbl\t%s(%%e%cx), %%e%cx\n", idarray,
reg[0], reg[0])
505
else if 'D' = type then
506
printf("\tmovzbl\t%d(%%e%cx,%%ebp), %%e%cx\n",
offset[0], reg[0], reg[0])
507
else if 'E' = type then
508
printf("\tmovl\t%d(%%ebp), %%esi\n", offset[0])
509
printf("\tmovzbl\t(%%esi,%%e%cx), %%e%cx\n", reg[0],
reg[0])
510
endif; endif; endif; endif; endif; endif
511 end
Function pusharrayvar() is the opposite of assign(). It gets the values
from
memory into the stack-cache, the general purpose registers of the CPU.
Conclusions
Writing the BASICO compiler was fun. Writing the documentation was not
so much fun. Still I hope that this source code documentation shows
that source code reading is possible. Some computer languages (Lisp,
Forth) allow to create your own control structures, which makes them
"write-only" if the programmers get lost in compiler extension dreams.
Others have such a big, big, big syntax, semantics and library (C, C++,
C#, JAVA) that a new programmer needs forever to learn all ways how to
do things and stays a beginner reader for a long time.
BASICO is a very simple language. Reading and writing BASICO source
code should be easy. And even if you never intend to write a program in
BASICO more complex then "Hello BASICO", the BASICO compiler shows you
how your source code is translated into assembler code.
1000 source code lines are read. That's all. Goodbye!
Previous chapter BASICO source code reading
- Parser 1
Back to BASICO title page