BASICO - Programming Language, Self-compiling compiler

Author: Andre Adrian
Version: 22.apr.2008

Abstract

BASICO
Please check the section BASICO source code reading to get a detailed discussion of the source code!

Contents

Introduction

In January 2006 Dr. Dobb's journal celebrated the 30th anniversary of Tiny BASIC. Tiny BASIC was specified by Dennis Allison and was implemented by Dick Whipple and John Arnold. Tiny BASIC needed 3 Kilobytes of memory and was sold for 5 US-dollars. The Microsoft BASIC of this time needed 8 Kilobytes and 150 US-dollars.
The original Tiny BASIC interpreter is written in an intermediate language (IL) to save memory. The CPU runs a virtual machine (VM) to emulate the IL processor. The IL processor interprets the Tiny BASIC program. Li-Chen Wang created a Tiny BASIC version without IL for the Z80 8-bit CPU.
One drawback of this approach is that the IL can handle strings, because it has to parse the BASIC keywords like PRINT, GOTO, but the Tiny BASIC itself has no string handling commands. IL has small memory requirement but also slow execution speed.
Dennis Ritchie used  compiler bootstrapping for the C compiler, like Niklaus Wirth did for the computer languages PASCAL, Modula and Oberon. The compiler itself is written in the language the compiler can translate. After an first iteration that needs external help, the compiler of iteration N can compile the source code that creates compiler of iteration N+1.

The project is pure fun. It is just a "what if" you ask 30 years later. In 1976 the knowledge for BASICO was already present. The "goto considered harmful" paper of Edsger W. Dijkstra was written in 1968. The first PASCAL compiler was written in 1971. The UNIX developers used compiler bootstrapping in 1973. The compiler tools lex (Lesk, 1975) and yacc (Johnson, 1974) were available.
The word BASICO can be read as "Beginner's All-purpose Symbolic Instruction Code zero" or as Basico, just something with an italian sound like lambrusco.

Note: The project started with the name "Tiny BASIC no GOTO (TB-NG)", but BASICO sounds so much better. Basico is the name of a small italian town (745 inhabitants).

BASICO syntax ideas

Today the programming language C is the standard. Other languages are comments to C. PASCAL is C with nested procedures but without stdio library. C++ is C overkill. Modula-2 is C without printf but with import and export. Ada is PASCAL++ for the DoD. JAVA is C++ with garbage collection but byte-code. Tcl/Tk is no C at all, but a language with minimum syntax. Last but not least BASIC is still not dead, but has included a lot of syntax from other languages.

GOTO considered harmful

Today everybody agrees on the "one coding block shall have one entry and one exit" paradigma " to shorten the conceptual gap between the static program and the dynamic process, to make the correspondence between the program (spread out in text space) and the process (spread out in time) as trivial as possible". But in the same paper Dijkstra also wrote "I remember having read the explicit recommendation to restrict the use of the go to statement to alarm exits".
Just deleting the goto statement and not introduce some other syntax for "alarm exits" does make programs less structured. Some alarm exits are
The traditional structured control flow for the "alarm exit" problem needs additional state variables or special function return-values that remember the alarm-status. Specially for return stack un-rolling the dynamic control flow is hard to see out of the static program listing.

Statement separator

The original BASIC of 1964 used linefeed as statement terminator. PASCAL uses ; as statement separator, C uses ; as statement terminator. Beginners often forget this silent syntax element. BASIC later adopted the : as an additional statement separator. BASICO avoids silent syntax elements, therefore both linefeed and ; are used as statement terminators. With linefeed as statement terminator a long statement can not be split in two source code lines.  Such a statement needs a continuation symbol to undo the statement terminator effect of the linefeed. In Tcl/Tk the \ as last character before linefeed works as continuation symbol. BASICO follows Tcl/Tk syntax.

Statement list

PASCAL encloses statements between the keywords BEGIN and END. C uses { and } for the same purpose. These statement list keywords give trouble to the beginner. First, they are silent syntax elements, second the beginner has problems to see the difference between statement and statement list.
The statement list keywords are not needed if the control flow keywords include them. One example is the WHILE condition DO statement-list WEND syntax. Another is the IF condition THEN statement-list ENDIF construct. The first statement-list is enclosed by DO and WEND, the second statement-list with THEN and ENDIF.

Dangling else

Dangling else is solved with the IF condition THEN statement-list ELSE statement-list ENDIF construct. With ENDIF there is no more dangling else shift/reduce conflict.

Assignment and Equal

BASIC uses = both for assigment and for test on equivalence, as it is mathematical tradition. Pascal uses := and =, C uses = and ==. As long as multiple assignment like a = b = c is not needed, the both meanings of = do not produce a conflict. The BASICO meaning of  a = b = c is a assign (b equal c), that is a becomes 1 if b is equal to c and 0 else. The author's all time favorite in simple C bugs, the nasty if (a = b) instead of if (a == b) is gone with BASICO.

Evaluation of conditional expressions

The symbols '<', '>=', .. are defined as conditional operators. The expression a >= b can result in 1 for true or 0 for false. There is no short circuit evaluation for conditional expressions. This is fine for today's CPUs that do not like jumps very much because of the execution queue flushing.

Variable type declaration

Pascal and C have strict variable types. Without declaration no variable can be used. BASIC needs a variable declaration only for array variables. Scalar variables can be used directly. BASICO follows Pascal and C. The variable types are integer, character, one-dimensional array of integer and one dimensional array of character.

Automatic type conversion

All calculations are done with integer. Therefore the char variables are promoted to integer. The BASICO char is an unsigned char. There are no negative char values.

Function calls

The C library functions like printf() or strcpy() can be called out of a BASICO program. The C call convention for the GNU C compiler is used. The function arguments are pushed from right to left on the stack. All arguments have the same size (32bit for GNU C on Intel x86). The return value in placed into a CPU register (eax for GNU C on Intel x86).

Nested functions

BASICO does not support nested functions. I even think that nested functions are a bad method for information hiding.

Call-by-value and Call-by-reference

Call-by-value is the method for int and char variables as function arguments: A copy of the variable is forwarded to the function. With call-by-value only input parameters are possible. Array variables are forwarded with call-by-reference: The memory location of the start of the array is given. This method allows arrays as input and output parameters.
The return value of the function is one output parameter. To get more output parameters, the trick in BASICO is to use arrays as function arguments to hold these output parameters.

Named constants

Pascal has named constants with the const keyword. C uses the #define preprocessor command for constants. BASICO has no named constants (for now).

Comment

BASIC uses REM to start a one-line comment. PASCAL and C have multi-line comments with { } or /* */. C++ re-introduced the one-line comment with //. BASICO uses the C++ one-line comment.

Compiler error handling

The compiler stops at the first error. The source code line is shown up to the point where the scanner, parser or code-generator detected the bug and an error message is given. For the batch processing age compiler gurus out there: one error is all you get from this compiler.

BASICO syntax

The following LL(1) context-free grammar in Extended Backus-Naur-Form (EBNF) for BASICO is final. It is based on the PASCAL syntax. The syntax is free of shift/reduce conflicts. All variable declarations are before variable use. A simple LL(1) predictive parser needs one pass to translate the source. Syntax terminals are enclosed in ' ' or " " like '<' and "if". Capital letter non-terminals come from the scanner (CONST, IDENT, ..). The empty set is commented. The scanner translates some linefeeds into ';'. This trick makes the syntax semicolon-free.
The EBNF was checked by Coco/R a scanner and parser generator for LL(1) grammars from Hanspeter Mössenböck. In EBNF, [ ] is 0 to 1 repetition, { } is 0 to n repetion, ( ) is grouping alternatives.

basico =
  [ "var" { globalDecl ';' } ]
  { "func" IDENT '(' paramList ')' ':' returnType ';' blockOrForward ';' }
  .

globalDecl =
  IDENT ':' (
    "int"
    | "char"
    | "array" CONST (
      "int"
      | "char"
    )
  ) . 

paramList =
  [ paramDecl { ',' paramDecl } ]
  . 

paramDecl =
  IDENT ':' (
    "int"
    | "char"
    | "array" (
      "int"
      | "char"
    )
  ) .
 
returnType =
  "int"                           
  | "char"                        
  | "void"                        
  .
  
blockOrForward =
  pblock3
  | "forward"
  .
 
pblock3 =
  [ "var" { localDecl ';' } ]
  "begin" stmtList "end"
  .

localDecl =
  IDENT ':' (
    "int"
    | "char"
    | "array" CONST (
      "int"
      | "char"
    )
  ) . 
 
stmtList =
  { [ statement ] ';' }
  .
 
statement =
  IDENT (
    [ '[' expr ']' ] '=' expr
    | exprList 
  )
  | "if" expr "then" stmtList [ "else" stmtList ] "endif"
  | "while" expr "do" stmtList "wend"
  | "return" [ expr ]            
  | "break"
  .

exprList =
  '(' [ expr { ',' expr } ] ')'
  .
 
expr =
  addexpr { '=' addexpr
          | '#' addexpr
          | '<' [ '=' ] addexpr
          | '>' [ '=' ] addexpr
  } .
 
addexpr =
  term {  '+' term
        | '-' term
        | '|' term
        | '^' term
  } .
 
term =
  unaryfact { '*' unaryfact
            | '/' unaryfact
            | '%' unaryfact
            | '&' unaryfact
  } .
 
unaryfact =
  '-' fact          
  | '~' fact        
  | fact
  .
 
fact =
  IDENT [ '[' expr ']' | exprList ]
  | CONST
  | CHRCONST             
  | STRCONST                       
  | '(' expr ')'
  .

BASICO example program

This is the old "Guess a number" game in BASICO. The example program uses the C functions getchar(), printf(), time(),  srand() and rand(). The function main() is the entry point like in C. This is the first BASICO game ever written. Please note the structured use of alarm exits with "return decimal" in getDecimal() and "break" in main().

// guess.bas
// Guess a number game
// Compile with BASICO version 0.9

func getDecimal():int
var decimal: int
  ch: char
begin
  decimal = 0;
  while 1 do
    ch = getchar()
    if (ch>='0')&(ch<='9') then
      decimal = decimal * 10 + ch - '0'
    else
      return decimal
    endif
  wend
end

func main():void
var myNumber: int
  yourNumber: int
  guesses: int
  ch: char
begin
  printf("Guess a number game\n")
  printf("Numbers are between 1 and 50\n") 
  srand(time(0))  // get time in seconds since 1970, init Random Generator
  while 1 do   
    myNumber = rand()     // get a random number  
    myNumber = myNumber % 50 + 1
    guesses = 0
    while 1 do
      printf("Your guess: ")
      yourNumber = getDecimal()
      guesses = guesses + 1
      if yourNumber = myNumber then
        printf("You guessed it in %d guesses!\n", guesses)
        break
      else
        if yourNumber > myNumber then
          printf("Your guess is to high\n")
        else
          printf("Your guess is to low\n")
        endif
      endif
    wend
    printf("Another game (y or n): ")
    ch = getchar();
    if (ch='n')|(ch='N') then
      break
    endif
    getchar()     // eat \n
  wend
  printf("Goodbye.\n")
end

Compile program:
./basico<guess.bas >guess.s
cc -o guess guess.s

Run program:
./guess
Guess a number game
Numbers are between 1 and 50
Your guess: 25
Your guess is to high
Your guess: 13
Your guess is to low
Your guess: 19
Your guess is to high
Your guess: 16
You guessed it in 4 guesses!
Another game (y or n): n
Goodbye.

BASICO development strategy

Compiler bootstrapping is done in several steps. We start with the development environment for one language, in our case C, and we end with the development environment for another language, BASICO. To minimize our effort, we only replace the C compiler with the BASICO compiler but keep the assembler, linker and library of the C tool-chain.

Predictive Parser for Expression

The following parser is an enhanced version of the infix to postfix translator in chapter 2.5 of "COMPILERS Principles, Techniques and Tools" by Aho, Sethi and Ullman. The program can only handle single digit numbers and can't skip white space, but does understand all single character BASICO operators like =, #, <, >, +, -, |, ^, *, /, %, &, ~, ( and ).

var lookahead: int

func main(): void
begin
  lookahead = getchar()
  expr()
  putchar('\n')
end

func expr(): void
begin
  addexpr()
  while 1 do
    if lookahead = '=' then
      match('='); addexpr(); putchar('=')
    else if lookahead = '#' then
      match('#'); addexpr(); putchar('#')
    else if lookahead = '<' then
      match('<'); addexpr(); putchar('<')
    else if lookahead = '>' then
      match('>'); addexpr(); putchar('>')
    else; break; endif; endif; endif; endif
  wend
end

func addexpr(): void
begin
  term()
  while 1 do
    if lookahead = '+' then
      match('+'); term(); putchar('+')
    else if lookahead = '-' then
      match('-'); term(); putchar('-')
    else if lookahead = '|' then
      match('|'); term(); putchar('|')
    else if lookahead = '^' then
      match('^'); term(); putchar('^')
    else; break; endif; endif; endif; endif
  wend
end

func term(): void
begin
  unaryfact();
  while 1 do
    if lookahead = '*' then
      match('*'); unaryfact(); putchar('*')
    else if lookahead = '/' then
      match('/'); unaryfact(); putchar('/')
    else if lookahead = '%' then
      match('%'); unaryfact(); putchar('%')
    else if lookahead = '&' then
      match('&'); unaryfact(); putchar('&')
    else; break; endif; endif; endif; endif
  wend
end

func unaryfact(): void
begin
  if lookahead = '-' then
    match('-'); fact(); putchar('-')
  else if lookahead = '~' then
    match('~'); fact(); putchar('~')
  else; fact(); endif; endif
end

func fact(): void
begin
  if lookahead = '(' then
    match('('); expr(); match(')')
  else if isdigit(lookahead) then
    putchar(lookahead); match(lookahead)
  else; error(); endif; endif
end

func match(t: int): void
begin
  if lookahead = t then
    lookahead = getchar()
  else; error(); endif
end

func error(): void
begin
  printf("syntax error\n")
  exit(1)
end

Compile program:
./basico04<rpn3.bas >rpn3.s
cc -o rpn3 rpn3.s

Run program:
./rpn3
7=1+2*3
7123*+=

The infix expression 7 test for equal with 1 plus 2 multiply by 3 is translated into the postfix expression push 7, 1, 2, 3, then multiply 2 by 3, add this with 1 and test this for equal with 7.

Examples for Compiler bootstrapping

Symbol Table

Niklaus Wirth used in PASCAL-S a two-dimensional symbol table. BASICO provides only one-dimensional arrays. The compiler program itself has to do the abstraction from 1-dim to 2-dim.
The identifiers in the program can be keywords (if, then, else), global variable names, function names, function parameter names or local variable names. The search in the symbol table is keywords first, then local and parameter variables and last global variables and function names. These three namespaces (keywords, local, global) can be implemented with three symbol tables. Or in one symbal table that has three sections. The first approach is implemented.
The variable name is the unique identifier in the symbol table to get and set the attributes of the variable. One attribute is the variable type like int, char, array-of-int and array-of char. Another is the storage type like global storage or local storage. One more attribute for parameter and local variables is the frame-pointer offset (frame-pointer relative memory location). See symtab.c for details.

Code Generator

The recursive decent parser creates op-codes for a stack-machine. Normal microprocessors are register machines that can handle 1 to 3 addresses per op-code. The Intel x86 chips or the Renesas R8C chip can handle 2 addresses. To bridge the gap between stack-machine and 2-address machine two or more registers of the CPU are used as stack-cache, that is TOS (top of stack), NOS (next of stack), third of stack and fourth of stack are in registers. Instead of moving the contents of the CPU registers to perform the push and pop stack operations, the stack-machine cache labels (TOS, NOS, THIRD, FOURTH) are re-mapped. Therefore CPU register %eax can be NOS at one time and TOS at another time. See codegen.c for details. Some examples of code generation for the x86. The BASICO statement follows as comment the assembler code:

    movl    a, %eax
    movl    $1, %ebx
    subl    %ebx, %eax
    movl    %eax, x
#   x=a-1

    movl    a, %eax
    movl    b, %ebx
    subl    %ebx, %eax
    movl    c, %ebx
    movl    d, %ecx
    subl    %ecx, %ebx
    movl    %edx, %esi
    cltd
    idivl   %ebx
    movl    %esi, %edx
    movl    %eax, x
#   x=(a-b)/(c-d)

    movl    a, %eax
    negl    %eax
    movl    b, %ebx
    movl    c, %ecx
    movl    d, %edx
    imul    %edx, %ecx
    subl    %ecx, %ebx
    subl    %ebx, %eax
    movl    %eax, x
#   x=-a-(b-c*d)

    movl    i, %eax

    movl    $1, %ebx
    addl    %ebx, %eax
    movl    n, %ebx
    movl    i, %ecx
    movl    bb(,%ecx,4), %ecx
    imul    %ecx, %ebx
    movl    %ebx, bb(,%eax,4)
#   bb[i+1] = n*bb[i]

    movl    a, %eax
    movl    b, %ebx
    notl    %ebx
    andl    %ebx, %eax
    movl    a, %ebx
    notl    %ebx
    movl    b, %ecx
    andl    %ecx, %ebx
    orl    %ebx, %eax
    movl    %eax, x
#   x=a&~b|~a&b

Peephole optimization

Wiktionary definition: An optimization that works by eliminating redundant instructions from a small area of code.

Some examples of possible peephole optimization for the Basico compiler:

original code
optimized code
comment
    movl    $0, %eax
    movl    %eax, -12(%ebp)
    movl    $0, -12(%ebp) The x86 CPU can move a constant to a memory location.
    movl    $48, %ebx
    cmpl    %ebx, %eax
    cmpl    $48, %eax
The x86 CPU can compare a constant with a register.
    movl    $48, %ebx
    subl    %ebx, %eax

    subl    $48, %eax
The x86 CPU can subtract a constant from a register.
    movl    $0, %eax
    pushl   %eax
    pushl   $0
The x86 CPU can push a constant.
    cmpl    %ebx, %eax
    sete    %al

    movzbl  %al, %eax
    andl    %eax, %eax
    jz      .L8
    cmpl    %ebx, %eax
    jnz     .L8
The compare op code sets the flags.
    cmpl    %ebx, %eax
    setg    %al
    movzbl  %al, %eax
    andl    %eax, %eax
    jz      .L10
    cmpl    %ebx, %eax
    jng    .L10
The compare op code sets the flags.

A peephole optimizer uses pattern matching to identify a code segment.

BASICO sources

The source code of BASICO version 0.8 is discussed in detail. The findings of this code review are implemented in BASICO version 0.9.

BASICO source code reading

Version 0.9

Here are the bugfixes as mentioned in the source code reading.

BASICO version 0.9

Version 0.8

The statements of a function that is only called from one location is copied to this location (inline the function). This reduced the source lines to 975. The  stripped throw-away compiler binary is 16432 bytes long and needs 46ms to process the basico.bas input file, the BASICO version is 18124 bytes long and needs 82ms. The GNU C compiler was using -O1 optimization. The simple idea of using CPU registers as stack machine cache pays of very well.

BASICO version 0.8

Version 0.7

The parser calls now the code generation functions. The BASICO compiler is written in BASICO. Compiler bootstrapping is possible. For the input file basico.bas the BASICO version 0.5 compiler produces the same assembler code as the BASICO 0.7 compiler. The compiler is 1279 source lines long.

BASICO version 0.7

Version 0.6

Now scanner, symbol table and parser are in BASICO. The parser can successful parse itself. The BASICO parser was transcribed from the CoCo/R Parser.cpp output file of the basico06.atg grammar input file.

BASICO version 0.6

Version 0.5

The compiler components scanner and symbol table are now available in BASICO. The BASICO version scanner.bas was created by transcribing scanner.c. The input file scanner.bas produces the same output file with the BASICO version and the C version of the scanner - like it should be. The compiled C version is 5844 bytes long and needs 12ms to process the scanner.bas input file, the BASICO version is 9928 bytes long and needs 22ms.
Version 0.51 combines the infix to postfix translator in BASICO with the scanner and symbol table. This is an intermediate step to the full BASICO parser in BASICO.

BASICO version 0.51

BASICO version 0.5

Version 0.4

The scanner is now written in C. For easy re-writing of the scanner in BASICO only a subset of C was used. The BASICO source is copied as comment into the output assembler listing.
Version 0.41 has the infix to postfix translator examples.

BASICO version 0.41

BASICO version 0.4

Version 0.3

This version uses symbol tables. Now different variable types and different storage types are working. Even call-by-reference for arrays as function parameters is implemented. The first version of our throw-away compiler is ready. Now work moves from back-end (code generation) to front-end (scanner, parser).
In version 0.31 the lex scanner translates the keywords into characters. The bison parser uses these characters as tokens. These changes prepare the parser for the new scanner. Another detail is the new handling of <= and >=. The boolean operator ^ for xor is new. The not operator is now ~ as in C.

BASICO version 0.31

BASICO version 0.3

Version 0.2

This version can compile expressions with global integer and global array-of-integer variables. Function call with return value is possible. Function call to C library functions with 3 parameters maximum is working, too.  The central missing element is the symbol table. The symbol table tells the type of the variable like char variable and/or local variable. Op-code generation for char and local variables is missing, too. But we have reached some level of "Tiny BASIC Compiler".
Version 0.21 allows comments after a statement.

BASICO version 0.21

BASICO version 0.2

Version 0.1

This version uses lex and yacc to implement the parser of the throw-away compiler. The parser does not only check that the syntax is okay, but does emit some semantic information. Some very first ideas on symbol table and code generation are included too.

BASICO version 0.1

Books

Compiler bauen mit UNIX
Introduction to compiler construction with UNIX
Axel-Tobias Schreiner with H. G. Friedman, Jr.,
german 1985, Hanser, München, ISBN 3-446-14359-9;
englisch 1985, Prentice-Hall Software Series, New York, ISBN 0-13-474396-2;
The book is no longer in print. But you can download the example programs here.

The UNIX Programming Environment
Der UNIX - Werkzeugkasten. Programmieren mit UNIX
Brian W. Kernighan, Rob Pike
german
2002, Hanser, München, ISBN 978-3446142732;
english 1984, Prentice Hall, Inc., ISBN 0-13-937681-X
The english version of the book is still in print. The example programs are here.