BASICO compiler source code reading - Scanner

Author: Andre Adrian
Version: 29.jun.2006


Reading source code is an important skill for the professional programmer. As programmer you enhance and debug source code of other programmers.
The BASICO compiler source code is less then 1000 lines. This is a small program. But even a small compiler is doing a non-trivial task: translate sentences from one language to another language.
The source code reading follows the program flow. As we (this is you the reader and me the writer) progress through the source code I try to explain the background as necessary. There is not a big heap of theory in front of the actual source code reading. Let's read a compiler!


Jack W. Crenshaw wrote the "LET'S BUILD A COMPILER!" installments. His practical look at predictive parser and detailed explanations about code generation for a real CPU were a big inspiration for BASICO.
Hanspeter Mössenböck created the CoCo/R compiler compiler. This tool invites to play with the LL(1) grammar for a hand-written parser. The BASICO grammar evolved by looking at the C++ source code that CoCo/R created for a given BASICO grammar. Then the grammar was tuned to produce simple source code. Last the C++ source code was transcribed to BASICO source code.
Many, many more people have contributed to the compiler topic.

Program execution

The BASICO compiler was written for the UNIX (Linux) operation system. The compiler uses standard in (stdin) to read the source code file and uses standard out (stdout) to write the assembler file. The assembler file contains as comments the source code lines to allow an easy understanding of the translation. Error messages are written to stdout, too.
After the compiler run, the assembler file has to be translated to machine code. The assembler is called through the cc program.
To compile the BASICO source code file basico.bas with the BASICO throw-away compiler basico05 you enter at the shell prompt:

./basico05<basico.bas >basico.s
cc -o basico basico.s

func main()

  954    func main(): int
 955    begin 
 956      inputi = 0  // init scanner
 957      GetChar()
 958      // init symbol table
 959      strcpy(keywords, "array begin break char do end endif forward func if int else return then var void wend while ")
 960      strcpy(tokens, "abBCdeEfFiIlrtvVwW")
 961      globals[0] = 0
 962      locals[0] = 0   // init locals
 963      cacheinit()     // file prologue
 964      token[127] = 0
 965      label = 0
 966      level = -1
 967      whilelevel = -1
 968      lco = 0
 969      printf("\t.file  \"a.bas\"\n")
 970      printf("\t.text\n")
 971      lookahead = yylex()
 972      basico()
 973      printf("\t.section .note.GNU-stack,\"\",@progbits\n") // file epilogue
 974      printf("\t.ident \"Basico 0.8\"\n")
 975    end

Function main() is the function that gets called from the operating system after you start a BASICO program. The line numbers you see are not part of the actual source code. They just make it easier to explain.
Line 954 describes what kind of function main() is. It is a function with no arguments that returns an integer value (int).
Line 955 and line 975 enclose the statements of main(). They give shape to the function and tell the compiler what is part of main().
The first statement in main() is line 956. A global variable inputi is set to 0. The declation of inputi and inputl are:

    13      inputl: array 128 char          // Input line
    14      inputi: int                     // Input line index

The variable inputi is the index for the charater array inputl. The compiler stores the actual source code line into inputl with the help of inputi - details come later. To set the index to 0 at the beginning makes a clean start: the logging of the actual source code line will start at location 0 of array inputl.
Line 957 calls the GetChar() function. This function is explained in the next section.
Line 959 uses the C library function strcpy() to copy the string literal "array ... while " into the variable keywords. This is how we teach the compiler the reserved words. Later the keywords character array will be used to detect if a name in the source code file is a reserved word.

    21      keywords:     array 100 char    // the reserved words

Line 960 uses strcpy() to copy the one-character-version of the keywords into the variable tokens. The reason for these abbreviated version is to make the compiler faster. The CPU can handle characters faster then strings. As you can think, the correlation between keywords and tokens is:

array begin break char do end endif forward func if int else return then var void wend while

Line 961 sets location 0 of character array globals to 0. To understand the reason we should look at how the C library functions detect the end-of-string condition. The character with the value 0 (the NUL character) is used as string delimiter. Setting the first character in a char array to 0 produces a valid NUL-terminated empty string.

    23      globals:      array 1000 char

Later on we will see that variable globals stores the names of the global variables of the program that is compiled. Talking about a compiler is sometimes a little confusing: We have a program that is the compiler that translates another program. If the BASICO compiler compiles the BASICO compiler source code we do compiler bootstrapping. The important point is to see the active part and the passive part in this. The active part is the compiler, the passive part is the compiled program. Line 962 initializes character array locals to a valid NUL-terminated empty string.

    25      locals:       array 1000 char

Line 963 calls function cacheinit(). Line 964 sets the last character in character array token to 0 as a final NUL character. This is an example for defensive programming. We try to make sure that token has a string delimiter whatever else happens.

    16      token:  array 128 char          // identifier or string

Line 965 initializes (inits) integer variable label to 0. In the assembler file we need assembler labels to define the jump targets. Each label needs a separate number. The variable label holds the up-to-now last label number.

    32      label:      int             // Label-id generator

Line 966 inits level to -1. This variable counts the if-then-else-endif nesting level.

    35      level:      int             // for if

Line 967 inits whilelevel to -1. This variable counts the while-wend nesting level.

    36      whilelevel: int             // for while

Line 968 inits lco to 0. The string literals in the assembler file are referenced with string labels. Again we need a counter to keep track of the up-to-now last string label number.

    37      lco:        int             // String const label

Lines 969 and 970 produce the first output to the assembler file. The both printf() C library functions write the following text to the assembler file:

    .file  "a.bas"

The pseudo opcode .file remembers the source code file name in the assembler file. Because the BASICO compiler works with standard in and standard out it does not know the real file name, therefore the fake file name a.bas is used for the source code file name. The .text pseudo opcode tells the assembler that the following lines in the assembler listing should go to the text segment. This is the location for the machine code. Line 971 calls the function yylex(). The return value of yylex() is stored in the variable lookahead. Line 972 calls basico(). Lines 973 and 974 produce the last output to the assembler file:

    .section .note.GNU-stack,"",@progbits
    .ident "Basico 0.8"

The purpose of these lines is not totally clear to me. I just reverse engineered it out of the GNU C compiler.

func GetChar()

  62    func GetChar(): void
  63    begin
  64      if Look = '\n' then
  65        inputl[inputi] = 0
  66        printf("# %s", inputl)
  67        inputi = 0
  68      endif 
  69      Look = getchar()
  70      if Look >= 0 then
  71        inputl[inputi] = Look
  72        inputi = inputi + 1
  73      endif
  74    end

As the name suggests, GetChar() gets one character from the source code file. This character is stored in the variable Look:

    12    var Look: int                     // Lookahead Character

Having a lookahead character is the easiest solution to the problem of how to detect the end of a regular expression like a variable name that starts with a letter and ends with a character that is not a letter nor a digit. This end-the-regular-expression character must be read from standard in to detect the end, but what to do with this character? You can't throw it away, because it can be the start of the next syntactic item. So you store it in the variable Look.
The if command in lines 64 to 68 controls the output of the actual source code line as comment to the assembler file. If Look is a line-feed character, the string in inputl gets NUL terminated in line 65. Output of the contents of inputl in done with printf() in line 66. Line 67 rewinds the index variable inputi to make it ready for the next source code line.
Line 69 uses the C library function getchar() to read one character from the source code file into variable Look. At end-of-file getchar() returns -1.
The if command in lines 70 to 73 check if Look contains a normal character that is greater or equal then 0. In line 71 Look is stored in inputl for later output. In line 72 the index inputi is increased.

func cacheinit()

  345    func cacheinit(): void
 346    begin
 347      se = 0
 348      strcpy(reg, "dcba")
 349    end

The BASICO compiler uses CPU registers to cache the 4 topmost values of the data stack. The function cacheinit() inits the mapping of  CPU registers to stack positions. The variable se counts the number of values on the stack.
The string literal "dcba" inits the stack cache. The registers are identified by a single character. After line 348 the variable reg contains:

reg[1] reg[2] reg[3]

func yylex()

For better reading some parts of function yylex() are shown later.

  91    func yylex(): int
  92    var i: int
  93      ch: char
  94    begin
  95      yytext[0] = 0
  97      // linefeed to ; translation is part of skip white space
  99      // ignore white space
 108      // ignore comment
 127      // one character tokens
 135      // char const
 173      // array char const
 200      // int const
 219      // keyword or ident
 241      if Look < 0 then
 242        return Look     // EOF
 243      endif
 245      error("Unknown character")
 246    end

The longest function in the BASICO compiler is yylex(). Most of the scanner (lexer) is realized with yylex(). A scanner reads in the source code on a character by character basis. Using regular expressions (RE) the scanner forms tokens: multi character strings that are the building blocks (atoms) of a program.
Function yylex() defines two local variables in lines 92 and 93. Variable i is an integer and ch is a character. Line 95 contains some defensive programming: The global variable yytext returns sometimes information from yylex(). To have a clean start, the first character of yytext is set to 0 to form an empty NUL terminated string. Lines 241 to 243 are part of an return stack unrolling alarm exit. If function GetChar(), which is called from yylex(), detects end-of-file, the variable Look gets a value less then 0. This end-of-file information has to go to the caller of yylex(), the function basico(). Lines 241 to 243 forward the alarm exit from GetChar() to basico(). A better solution would be try/throw/catch. But this kind of alarm exit is not implemented in BASICO.
Line 245 is another alarm exit. This time the compiler gets terminated with the message "Unknown character".

One character token

 128      if in(Look, "-+*/%&|^~,:;()=#<>[]") then
 129        nows = 1
 130        ch = Look
 131        GetChar()
 132        return ch
 133      endif

Some tokens are one character tokens. In this case the ASCII representation of the token is used directly as the one character representation of the token in the parser. Lines 128 to 133 scan the one character tokens. The if in line 128 uses the function in() to compare the character in variable Look with a set of characters in the string literal "-+*/%&|^~,:;()=#<>[]". If the contents of Look is in the set, the No-Whitespace flag nows is set to 1 in line 129. The actual contents of Look is stored in variable ch in line 130. To have a new lookahead character the next character is read from the source file with GetChar() in line 131. Line 132 exits the function yylex() with the contents of variable ch as return value.

Integer literal

 201      if isdigit(Look) then
 202        nows = 1; i = 0
 203        yytext[i] = Look
 204        i = i + 1
 205        GetChar()
 206        while isdigit(Look) do
 207          yytext[i] = Look
 208          i = i + 1
 209          GetChar()
 210          if i > 120 then
 211            error("number too long")
 212          endif
 213        wend
 214        yytext[i] = 0
 215        number = atoi(yytext)
 216        return 'n'
 217      endif

CONST is the easiest multi character token. CONST is an integer literal, a number without decimal point nor sign. The RE for CONST is:

  digit = "0123456789".
  CONST  = digit {digit}.

A digit is one of the ASCII characters 0, 1, 2, .. 9. A CONST starts with a digit. After the first digit there can be 0 or more digits. The meta symbol { } reads "0 to N repetition".
Lines 201 to 217 realize the CONST scanner. The if in line 201 uses the isdigit() C library function to trigger on a digit. In line 202 the No-Whitespace flag nows is set to 1 and we rewind the index i to 0. Lines 203 and 204 store the actual Look value into yytext and increment the index.
Line 205 gets the next character from source file through GetChar(). The while command in lines 206 to 213 copies the following digits into yytext. The if command in lines 210 to 212 is again defensive programming: We avoid to let i become larger then the length of yytext. Without this check it is possible to overwrite data in the memory locations after the memory location of yytext. Such an overwrite can be the start of a "buffer overflow" attack.  So, watch your buffers!
Lines 214 to 216 finish the job of CONST recognition. The while statement in line 206 jumps to line 214 if the condition isdigit() is no longer true. In line 214 the string in yytext gets NUL terminated. Line 215 converts the string in yytext into a integer that is stored in the variable number. We use the C library function atoi() for this. Line 216 exits the function yylex() with a return value of 'n'. This 'n' tells the parser that yylex() found a number in the source file. The variable number is used as a scanner attribute to hold the integer literal value.


  220      if isalpha(Look) then
 221        nows = 1; i = 0
 222        yytext[i] = Look
 223        i = i + 1
 224        GetChar()
 225        while isalnum(Look) do
 226          yytext[i] = Look
 227          i = i + 1
 228          GetChar()
 229          if i > 120 then
 230            error("ident too long")
 231          endif
 232        wend
 233        yytext[i] = 0
 234        ch = gettoken(yytext)
 235        if ch = 'x' then
 236          strcpy(token, yytext)
 237        endif     
 238        return ch
 239      endif

The next RE is IDENT. IDENT matches a variable name (identifier). To separate a variable name from an integer literal, a variable is not allowed to start with a digit. This leads to the variable name RE that is common to nearly all programming languages:

  letter = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz".
  digit = "0123456789".
  IDENT  = letter {letter | digit}.

An IDENT starts with a letter. After this letter there can be 0 or more letters or digits. Valid examples of IDENT are: a, i, Tom, func99, myLittleFunction.
The meta-symbol | is alternative. Lines 220 to 239 scan an IDENT. The general layout is like CONST. Line 220 has an if statement with C library function isalpha() to detect a letter as first character of the IDENT. Lines 221 to 232 scan the following characters of IDENT. The C library function isalnum() returns a nonzero value for an argument that is letter or digit - an alphanumerical character. Line 233 makes the contents of yytext a NUL terminated string. Line 234 uses the function gettoken() to check if the IDENT was a reserved word. This is a little trick: a reserved word like "begin" has the same RE as an IDENT. So we first detect the more common pattern of IDENT and then check if the IDENT is a reserved word. The function gettoken() returns 'x' if the argument was not a reserved word. The if command in lines 235 to 237 uses this behaviour to copy yytext only to token if yytext contains an identifier. Line 238 returns the contents of ch. The variable can contain the one-character-versions of the reserved words or 'x' . The variable token is used as a scanner attribute to hold the  identifier name.

Character literal

  136      if Look = '\'' then
 137        nows = 1
 138        GetChar()
 139        if Look = '\\' then
 140          GetChar()
 141          if Look = 'n' then
 142            yytext[0] = 10
 143          else if Look = 't' then
 144            yytext[0] = 9
 145          else if Look = '\'' then
 146            yytext[0] = 39
 147          else if Look = '\\' then
 148            yytext[0] = 92
 149          else
 150            error("unknown \\ char")
 151          endif; endif; endif; endif
 152          GetChar()
 153          if Look = '\'' then
 154            GetChar()
 155            token[0] = yytext[0]
 156            return 'c'
 157          else
 158            error("missing \'")
 159          endif       
 160        else
 161          yytext[0] = Look
 162          GetChar()
 163          if Look = '\'' then
 164            GetChar()
 165            token[0] = yytext[0]
 166            return 'c'
 167          else
 168            error("missing \'")
 169          endif       
 170        endif
 171      endif  

A character literal contains one ASCII character enclosed in ' '. The printable characters like ' ' for space and 'a' for lower case a are simple. Two kinds of characters are special: non-printable characters like linefeed and tabulator and the representation of the ASCII character ' itself. Both special cases are handled with an escape character, the \ character. The character after the escape character is used as literal. One problem solved, next problem pops up: What to do if we want to use the escape character itself as literal? Well, this is not a real problem, because any character after the escape character is used in literal fashion, including the escape character. Together we get the familiar \' for escape character \ and literal ' and \\ for escape character \ and literal \. The non-printable characters use the C language convention: \n is linefeed and \t is tabulator.
Lines 136 to 171 scan for character literal. The if statement in line 136 looks for the starting '. In line 137 the No-Whitespace flag nows is set to 1. Line 138 reads one character from the source code file. Lines 139 to 159 handle the special case of escape character. The if statement in 139 looks for \. Line 140 reads one more character with GetChar(). The lines 141 to 151 look more complicated as they are. The idea is to select among more then two possibilities. Because BASICO has only if-then-else, we have to cascade many if-then-else to get a N-way selection. With traditional nesting the lines 141 to 151 look like:

  if Look = 'n' then
    yytext[0] = 10
    if Look = 't' then
      yytext[0] = 9
      if Look = '\'' then
        yytext[0] = 39
        if Look = '\\' then
          yytext[0] = 92
          error("unknown \\ char")

The readability in traditional writing is worse then in the "else if" writing style. In both representations the lines 141 to 151 do the same: Compare the contents of variable Look with the characters 'n', 't', '\'' and '\\'. If the compare matches, the values 10, 9, 39 or 92 are stored in yytext[0], the first location in the character array. If no compare matches, the function error() is called in line 150.
Lines 152 to 159 handle the closing ' of character literal. If found the function yylex() returns 'c' for character literal. The token attribute is in token[0]. Any other character then ' terminats the compiler through the error() function.
Line 161 stores for the normal case the character value in yytext[0]. The lines 162 to 169 are identical to 152 to 159 and fulfill the same purpose - with refactoring these lines can be eliminated. See basico.bas version 0.9.

String literal

  174      if Look = '"' then
 175        nows = 1; i = 0
 176        yytext[i] = Look
 177        i = i + 1
 178        GetChar()
 179        while Look # '"' do
 180          if Look = '\\' then
 181            yytext[i] = Look
 182            i = i + 1
 183            GetChar()
 184          endif
 185          yytext[i] = Look
 186          i = i + 1
 187          GetChar()
 188          if i > 120 then
 189            error("string to long")
 190          endif
 191        wend
 192        yytext[i] = Look
 193        i = i + 1
 194        GetChar()
 195        yytext[i] = 0
 196        strcpy(token, yytext)
 197        return 's'
 198      endif

A string literal is just a sequence of ASCII characters that are enclosed between " and ". With this definition it is not possible to have the ASCII character " in a string literal. Again an escape character is used. We get the familiar \" for escape character \ and literal " and \\ for escape character \ and literal \. The string literal scan is very similar to the character literal scan. One difference is the " delimiter instead of '. Another difference is the while command in lines 179 to 191. This while loop scans over all characters in the string literal in the search for the closing ". A " behind a \ is not recognized as closing " because every time an escape character is detected in line 180, two characters are scanned.  The escape character in lines 181 to 183 and the second character in lines 185 to 187. The function yylex() returns 's' for a string literal. The token attribute is in variable token.

yylex() return values

The yylex() scanner returns the following information.

one character token
integer literal
character literal
string literal
function returns
one character token
keyword token

Skip white space

 100      while (Look = ' ') | (Look = '\t') | (Look = '\n') do
 101        if (Look = '\n') & nows then
 102          nows = 0
 103          return ';'
 104        endif
 105        GetChar()
 106      wend

White space are the ASCII characters space and tabulator. The linefeed character can be white space or not, depending on the value of the No-Whitespace flag nows. Here is the trick: The BASICO grammar uses ';' as line terminator, but you have to type ';' only if you want to put more then one BASICO statement on one source code line. The scanner converts linefeed into white space or into ';' depending on variable nows. As you have seen above nows is set to 1 every time a valid token was found. The if command in lines 101 to 104 implement the wanted behaviour If there is a linefeed and if nows was set, the function yylex() returns ';', else it reads more white space.

Skip comment

 109      while Look = '/' do
 110        GetChar()
 111        if Look # '/' then
 112          return '/'
 113        endif   
 114        while Look # '\n' do
 115          GetChar()  
 116        wend
 117        if nows then
 118          nows = 0
 119          return ';'
 120        endif
 121        GetChar()
 122        while (Look = ' ') | (Look = '\t') | (Look = '\n') do
 123          GetChar()
 124        wend
 125      wend

The BASICO comment starts with // and ends with linefeed. One / alone is the one-character token for division. Line 109 checks for the first /. Line 111 checks for the second /. If no second / is found, the function yylex() returns '/'. Lines 114 to 116 read up to linefeed. Lines 117 to 120 check if a linefeed has to be translated into ';'. Lines 121 to 124 read over white space.

func in()

  77    func in(ch: char, s: array char): int
  78    var i: int
  79    begin
  80      i = 0
  81      while s[i] # 0 do
  82        if ch = s[i] then
  83          return 1
  84        endif
  85        i = i + 1
  86      wend
  87      return 0
  88    end

The function in() interprets a string as a set of characters and performs the in operator: Is character ch in set s? The answer can be 1 for yes or 0 for no.
The variable declaration of the string s has no string length. In the programming language PASCAL this would be an error. But in C this is fine. Because PASCAL knows the string length, the PASCAL program can automatically check for buffer overwriting. But on the other hand this feature makes it hard to use strings with different length. Today we live with C and buffer overwriting if the programmer has forgotten to write the necessary defensive code.
In C++ and JAVA you can use "intelligent" strings that avoid the old C and PASCAL string problems. But the operating systems of today and many libraries still use the old string concept. Example: The filename for the file open operating system function is an "intelligent" string in your C++ or JAVA program, but you need an old fashioned string for the system call. Even worse is the other way around:  using OS function readdir() to read the filenames or directory names. The OS gives you only old fashioned strings that you have to convert to "intelligent" strings. All in all: write everything yourself on the bare bone hardware, then you can make real use of automatic buffer overwrite checks. If you do not have this luxury learn to write defensive code!

func error()

  53    func error(msg: array char): void
  54    begin
  55      inputl[inputi] = 0
  56      printf("\n# %s\n", inputl)
  57      printf("# error %s\n", msg)
  58      exit(1)
  59    end

The error() function writes two last messages to the assembler file. Line 56 writes the actual source code line as comment. Line 57 writes the error message as comment, too. The C library function exit() aborts the program. The argument 1 tells the operating system that the program did finish with error.

func gettoken()

 282    func gettoken(name: array char): char
 283    var id: int
 284    begin 
 285      id = lookup(keywords, name)
 286      if id >= 0 then
 287        return tokens[id]
 288      else
 289        return 'x'
 290      endif
 291    end

The function gettoken() is a typical "enrichment" function. To do the real work it calls the function lookup(). The return of lookup() is modified and enriched by gettoken(). The purpose of gettoken() is to detect if the contents of variable name is a reserved word like "endif". The function lookup() can search a name in a list of names, but if the name is not found, lookup() returns the value -1. Function gettoken() modifies the return to make it 'x' instead of -1 in line 289.
If the name was found, the return of lookup() has to be converted to the one-character-representation of the reserved words, like 'E' for endif. This index to value translation is done in line 287. The character array tokens contains the translation table.

func lookup()

 252    func lookup(symtab: array char, name: array char): int
 253    var cn: char; cs: char
 254      in: int; is: int
 255      found: int; sym: int
 256    begin
 257      in = 0; cn = name[in]
 258      is = 0; cs = symtab[is]
 259      found = 1; sym = 0
 260      while cs # 0 do    // table end
 261        if cn # cs then
 262          found = 0
 263        endif
 264        is = is + 1; cs = symtab[is]
 265        if cn # 0 then
 266          in = in + 1; cn = name[in]
 267        endif
 268        if cs = ' ' then                    // symbol end
 269          if (found = 1) & (cn = 0) then
 270              return sym
 271          else
 272            in = 0;  cn = name[in]
 273            is = is + 1; cs = symtab[is]
 274            sym = sym + 1
 275            found = 1
 276          endif
 277        endif
 278      wend
 279      return -1
 280    end

Function lookup() is the string matcher of the scanner. The scanner uses the global character arrays globals, locals and keywords to store all identifiers and reserved words. Function lookup() is called with one of these symbol tables as argument symtab and the name that is to be matched as argument name. It returns the index position in the symbol table with 0=first position, 1=second position or -1 for name not found.
Lines 257 and 258 rewinds the indices and loads the actual characters to be compared in cn and cs. The local variable cn contains the characters of the name, the variable cs contains the characters of the symbol table. Line 259 inits the is-equal flag found to 1 and sets the index position in the symbol table sym to 0.
Lines 260 to 278 are a while command over the symbol table. The while loop ends when the NUL character in the symbol table is found.
Lines 261 to 263 do a character by character compare. If the characters are different the found flag is set to 0 to show mismatch.
Lines 264 to 267 increase the indices and load the characters for the name and symtab arrays. The if statement in line 265 avoids to read behind the NUL character of the variable name.
The separator charater between two entries in the symbol table is ' '. The if command in line 268 to 277 is triggered at the end of one entry of the symbol table. In line 269 success or failure is detected. The string match was successful if the character success flag found is still 1 and the complete name was compared. Then the function returns with the value of sym in line 270.
Lines 272 to 275 prepare the next comparsion and are similiar to lines 257 to 259.

Next chapter BASICO source code reading - Parser 1
Back to BASICO title page