PCCTS Version 1.0x Advanced Tutorial Terence Parr, Hank Dietz, Will Cohen School of Electrical Engineering Purdue University West Lafayette, IN 47907 Spring 1992 parrt@ecn.purdue.edu hankd@ecn.purdue.edu cohenw@ecn.purdue.edu Constructing a translator can be viewed as an iterative refine- ment process moving from language recognition to intermediate-form transformation. This document presents one possible sequence of refinements. We shall use as many features of PCCTS as is reasonable without regards to optimality. We will develop a compiler for a simple string manipulation language called sc. The compiler will generate code for a simple stack machine. This document assumes familiarity with the concept of a parser generator and other language recognition tools - in particular, a passing familiarity with YACC is useful. The development will proceed as follows: (i) Define a grammar to recognize functions, statements, variable definitions and expressions in sc. (ii) Add symbol table management to the grammar developed in (i) (iii)Add actions to translate sc into stack code for a simple stack machine. (iv) Construct trees as an intermediate-form. (v) Build a code-generator to translate the trees into executable C code. Page 1 PCCTS Advanced Tutorial 1.0x 1. sc Language definition sc is a simple, C-like language with only one data object type- string. It has global variables, functions, arguments and local vari- ables. There are no arrays or structures. 1.1. sc Expressions Expression atoms are limited to variables, string constants ("character(s)") and function calls (function(expr)). The string con- stants "false" or "0" represent false and true or "1" represent true. A limited set of operators are available (from highest to lowest pre- cedence) - Unary operator negate. Numeric operator. *, / Binary operators multiply and divide. Groups left-to-right. Numeric operator. +, - Binary operators add and subtract. Groups left-to-right. Numeric operator except for + which is "concatenate" if one of operands is non-numeric. ==, != Binary operators equal, not equal. Groups left-to-right. Numeric or non-numeric. = Binary assignment operator. Groups right-to-left. For exam- ple, a = b = c means to assign c to b and then to a. 1.2. sc Functions and statements Functions may have multiple local variables, but at most one parameter. All functions implicitly return a string to the calling function, although the value need not be used. If a return-statement is not executed within a function, the return value for that function is undefined. Functions are defined exactly as in C: function(arg) { statement(s) } A function called main() must exist so that our interpreter and compiled code knows where to begin execution. Also, main() always has a implicitly passed parameter which contains any command-line argu- ment. sc has six statements. if Conditionals have the form: Page 2 PCCTS Advanced Tutorial 1.0x if ( expr ) statement whileLoops have the form: while ( expr ) statement { statement(s) } Block statements treat more than one statement as a single state- ment as in C. However, one cannot define variables local to a block statement. expr;An expr can include any valid sc expression, but only expressions involving assignments and function calls are useful. return expr ; This statement behaves exactly like the C return statement except that only strings can be returned. print expr; Print the string described by expr to stdout. 1.3. Example The following code is a simple sc program that computes n fac- torial. main(n) { print fact(n); print "\n"; } fact(n) { if ( n == "0" ) return "1"; return n * fact(n - "1"); } 2. Language recognition We begin our project by constructing a parser-a program that will recognize phrases in sc. The first two subsections describe the sc lexicon while the later subsections present the sc grammar with and without symbol table management. Page 3 PCCTS Advanced Tutorial 1.0x 2.1. Lexical analysis Our language has only strings of upper- and lower-case letters, called WORD's, operators, string constants and a few grammatical grouping symbols; all of which except WORD will be defined implicitly by mentioning them in the grammar. WORD will be placed last in the grammar-i.e. after all keywords have been defined. Since keywords are really just strings of characters as well, they must be treated spe- cially. When two (or more) regular expressions can be matched for the current input text, DLG scanners resolve the ambiguity by matching the input to the expression mentioned first in the description. Hence, we place #token WORD "[a-zA-Z]+" at the bottom of the file. Tabs, blanks and carriage-returns are all considered white space and are to be ignored (although we wish to track line numbers). The following ANTLR code will instruct the parser to skip over white space. #token "[\t\ ]+" << zzskip(); >> /* Ignore White */ #token "\n" << zzline++; zzskip(); >> Strings are strings of characters enclosed in double-quotes. For simplicity we will allow any character not in the ASCII range 0..0x1F (hex) plus any "escaped" version of the same character set. #token STRING "\"(~[\0-\0x1f\"\\]|(\\~[\0-\0x1f]))*\"" <<;>> 2.2. Attributes The DLG scanner matches keywords, WORD's, etc... on the input stream. The way ANTLR parsers access that text is though the attri- bute mechanism. Attributes are run-time objects associated will all grammar elements (items to the right of the : in a rule definition). Although attributes are associated with subrules and rule references, only those attributes linked to grammar tokens are of interest to us. Attributes are denoted $i where i is the grammar element i positions from the start of the alternative. The user must define an attribute type called Attrib. This type is most often some sort of string whether constant or dynamically allocated. For simplicity, we employ a standard attribute type available with the PCCTS system. Attributes are structures that contain a constant width (30 character) string. Page 4 PCCTS Advanced Tutorial 1.0x By placing the character array inside of a structure, the entire string can be copied like any simple C variable. The ANTLR pseudo-op #header is used to describe attributes and other information that must be present in all generated C files. In our case, we simply need to include the standard text attribute definition. This pseudo-op must occur first in the grammar file if it exists at all. #header <<#include "charbuf.h">> 2.3. Grammatical analysis This subsection describes the grammatical or syntactic aspects of sc. No symbol table management is used and therefore functions and variables are considered simple WORD's. Later versions of the grammar can use the tokens VAR and FUNC. 2.3.1. Definitions, variables An sc program is a sequence of definitions-variables or func- tions: p : ( func | "var" def ";" )* ; The ( )* subrule means that there are zero or more definitions. The | operator starts a new alternative. The grammar for a variable definitions is broken up between the rule p and def so that def can be reused for parameter definitions (it also makes more sense when code generation has been added). def : WORD ; 2.3.2. Functions sc functions follow C's format except that the default return type is a string instead of int. func : WORD "\(" { WORD } "\)" "\{" ( def )* ( statement )* "\}" ; Page 5 PCCTS Advanced Tutorial 1.0x Note that the parentheses and the curly-braces must be escaped with \ because they are special regular expression symbols. 2.3.3. Statements The statements outlined in the sc language definition can be described with statement : expr ";" | "\{" ( statement )* "\}" | "if" "\(" expr "\)" statement {"else" statement} | "while" "\(" expr "\)" statement | "return" expr ";" | "print" expr ";" ; where { } means optional. 2.3.4. Expressions The operators and expression atoms described in the language definition can be recognized by the following grammar fragment. expr : WORD "=" expr | expr0 ; expr0 : expr1 ( ("=="|"!=") expr1 )* ; expr1 : expr2 ( ("\+"|"\-") expr2 )* ; expr2 : expr3 ( ("\*"|"/") expr3 )* ; expr3 : {"\-"} expr4 ; expr4 : STRING | WORD { "\(" { expr } "\)" } | "\(" expr "\)" ; Rule expr is ambiguous if we only look one token into the future since WORD can also be an expr0. However, if we were to tell ANTLR to use two tokens, it could see that the "=" assignment operator uniquely Page 6 PCCTS Advanced Tutorial 1.0x identifies which alternative to match when WORD is the first token of lookahead. Rule expr makes this grammar LL(2); but, we only use the extra token of lookahead in expr (leaving all others LL(1)). Please note the use of ANTLR's command-line option "-k 2" in the makefiles presented below. ANTLR does not have a method of explicitly outlining operator precedence. Instead precedence is implicitly defined by the rule invocation sequence or abstractly by the parse-tree. Rule expr calls expr0 which calls expr1 etc... nesting more and more deeply. The pre- cedence rule-of-thumb in ANTLR (and any LL-type parser) is: the deeper the nesting level, the higher the precedence (the more tightly the operator binds). Operators in the expression starting rule have the lowest precedence whereas operators in the last rule in the expression recursion have the highest precedence. This becomes obvious when a parse-tree for some input text is examined. Once again, note that the operators for sc must be escaped as they are reserved regular expression operators as well. Page 7 PCCTS Advanced Tutorial 1.0x 2.4. Complete ANTLR description (w/o symbol table management) The following code is the complete ANTLR description to recognize sc programs. Nothing is added to the symbol table and undefined variables/functions are not flagged. Page 8 PCCTS Advanced Tutorial 1.0x #header <<#include "charbuf.h">> #token "[\t\ ]+" << zzskip(); >> /* Ignore White */ #token "\n" << zzline++; zzskip(); >> #token STRING "\"(~[\0-\0x1f\"\\]|(\\~[\0-\0x1f]))*\"" <<;>> << main() { ANTLR(p(), stdin); } >> p : ( func | "var" def ";" )* "@" ; def : WORD ; func : WORD "\(" { def } "\)" "\{" ( "var" def ";" )* ( statement )* "\}" ; statement : expr ";" | "\{" ( statement )* "\}" | "if" "\(" expr "\)" statement {"else" statement} | "while" "\(" expr "\)" statement | "return" expr ";" | "print" expr ";" ; expr : WORD "=" expr | expr0 ; expr0 : expr1 ( ("==" | "!=") expr1 )* ; expr1 : expr2 ( ("\+" | "\-") expr2 )* ; expr2 : expr3 ( ( "\*" | "/" ) expr3 )* ; expr3 : {"\-" } expr4 ; expr4 : STRING | WORD { "\(" { expr } "\)" } | "\(" expr "\)" ; #token WORD "[a-zA-Z]+" Page 9 PCCTS Advanced Tutorial 1.0x 2.5. Makefile The following makefile can be used to make the above language description into an executable file. We assume that the ANTLR includes and standard attribute packages are located in a directory accessible to us as tut1.g. # # Makefile for 1.00 tutorial (no symbol table stuff) # ANTLR creates parser.dlg, err.c, tut1.c, tokens.h # DLG creates scan.c, mode.h # CFLAGS= -I../h GRM=tut1 SRC=scan.c $(GRM).c err.c OBJ=scan.o $(GRM).o err.o tutorial: $(OBJ) $(SRC) cc -o $(GRM) $(OBJ) # build a parser and lexical description from a language description $(GRM).c parser.dlg : $(GRM).g antlr -k 2 $(GRM).g # build the scanner from a lexical description scan.c : parser.dlg dlg -C2 parser.dlg scan.c Remember that make wants a tab, not spaces, in front of the action statements (e.g. cc, antlr, ...). 2.6. Testing After successful completion of make, the executable tut will exist in the current directory. tut takes input from stdin. There- fore, one parses a file via tut < file.sc The prompt will return without a message if no syntax errors were discovered. However, if file contains lexical or syntactic errors, error messages will appear. We have given no instructions related to translation, so nothing happens if all is well. 2.7. Symbol table management The grammar presented thus far can only recognize sc programs. No translation is possible because it does not deal with the semantics Page 10 PCCTS Advanced Tutorial 1.0x of the input program only its structure. To begin semantic interpre- tation, one must add new symbols to a symbol table. The symbols required to "understand" an sc program are VAR Symbol is a local (denoted with the #define constant LOCAL), a parameter (PARAMETER) or global (GLOBAL) variable. FUNC Symbol is a function whose level is always GLOBAL. A symbol table record requires two fields: token which indicates either VAR or FUNC and level which is either LOCAL, PARAMETER or GLO- BAL. The PCCTS system comes with a simple symbol table manager. The source is documented well and its functions will be referenced without explanation here. To use the functions, our grammar must include a file called sym.h and define a symbol table structure. We shall put the #include in the #header directive. #header <<#include "sym.h" #include "charbuf.h">> The file sym.c contains the actual functions and must be linked into your executable. Our makefile will handle this automatically. The sym.c can be found in the support/sym subdirectory of the standard PCCTS installation; this should be copied into the tutorial directory. The symbol table manager requires a number of fields within the symbol table entry. A template has been provided that the user can copy into their work directory and modify. The template in sym.h will be modified in our case to include the two fields mentioned above- token and level. typedef struct symrec { char *symbol; struct symrec *next, *prev, **head, *scope; int token; /* either FUNC or VAR */ int level; /* either LOCAL, GLOBAL, PARAMETER */ int offset; /* offset from sp on the stack */ /* locals are - offset, param is 0 */ /* used only in tut4; reserved */ } Sym, *SymPtr; We add the following definitions to the front of our grammar. Page 11 PCCTS Advanced Tutorial 1.0x << #define HashTableSize 999 #define StringTableSize 5000 #define GLOBAL 0 #define PARAMETER 1 #define LOCAL 2 static Sym *globals = NULL; /* global scope for symbols */ >> where globals is used to track all global symbols (functions and vari- ables). Also, to print out symbol scopes, we define a function called pScope(Sym *p) that dumps a scope to stdout. It's implementation is unimportant and given with the full grammar description listed below. To initialize the symbol table, we add a function call to the symbol table manager library yielding a new main(). main() { zzs_init(HashTableSize, StringTableSize); ANTLR(p(), stdin); } When a variable, parameter or function is defined, we want to add that symbol to the symbol table. We shall treat parameters like vari- ables grammatically and use field level to differentiate between them. When a WORD is found in rule def, we will add it to the symbol table using the scope and level passed into def. The situation is compli- cated slightly by the fact that a local variable may have the same name as a global variable. Scopes are linked lists that weave through the symbol table grouping all entries within the same scope. Our grammar for rule def becomes: def[Sym **scope, int level] : <> (WORD | VAR) ; where the init-action defines a local variable called var. To handle the definition of previously unknown symbols, we add an action after the WORD reference. Page 12 PCCTS Advanced Tutorial 1.0x ( WORD <level = $level; /* set the level to level passed in */ var->token = VAR; /* symbol is a variable */ >> | VAR ) To deal with a symbol defined in another scope we add the following action to the VAR reference. ( WORD <<...>> | VAR <level ) /* make sure we have a diff scope */ { zzs_scope($scope); /* same here as above for unknown */ var = zzs_newadd($1.text); var->level = $level; var->token = VAR; } else printf("redefined variable ignored: %s\n", $1.text); >> ) Note that this implies that the lexical analyzer will modify the token number according to its definition in the symbol table (if any). This is generally not a good idea, but can be quite helpful when trying to remove nasty ambiguities in your grammar. Typically more than one token of lookahead makes it unnecessary to use "derived" tokens. We do so here to illustrate the interaction of lexical analyzer and parser. Rule p must be modified to pass a scope and level to rule def. In addition, we will remove the global scope at the end of parsing and print out the symbols. p : <> ( func | "var" def[&globals, GLOBAL] ";" )* <

> "@" ; Page 13 PCCTS Advanced Tutorial 1.0x Rule func now must create a FUNC symbol table entry and define a VAR entry for its parameter if one exists. Rule func checks for duplicate FUNC entries by only matching unknown WORD's. If a function had been previously defined, its token would be FUNC. func : <> WORD <level = GLOBAL; var->token = FUNC; >> "\(" { def[&locals, PARAMETER] } "\)" "\{" ( "var" def[&locals, LOCAL] ";" )* ( statement )* "\}" <

> ; At the end of the function, we remove the local scope of variables (and parameter if it exists) and print the symbols out to stdout. When a WORD is encountered on the input stream, we need to look it up in the symbol table to find out whether it is a variable (param- eter) or a function. The token number needs to be changed accordingly before the parser sees it so that it will not try to match a WORD. Any WORD references in an expression that are not defined in the sym- bol table are undefined variables. We accomplish this token "deriva- tion" strategy by attaching an action to the regular expression for WORD. #token WORD "[a-zA-Z]+" <<{ Sym *p = zzs_get(LATEXT(1)); if ( p != NULL ) NLA = p->token; }>> The macro LATEXT(1) is always set to the text matched on the input stream for the current token. NLA is the next token of look-ahead. We need to change this from WORD to whatever is found in the symbol table. Rules for statements and expressions do not change when adding symbol table management because they simply apply a structure to Page 14 PCCTS Advanced Tutorial 1.0x grammar symbols and do not introduce new ones. 2.8. Complete ANTLR description (with symbol table management) The following code is the complete ANTLR description to recognize sc programs. Functions, variables and parameters are added to the symbol table and are printed to stdout after function definitions and at the end of the sc program. #header << #include "sym.h" #include "charbuf.h" >> #token "[\t\ ]+" << zzskip(); >> /* Ignore White */ #token "\n" << zzline++; zzskip(); >> #token STRING "\"(~[\0-\0x1f\"\\]|(\\~[\0-\0x1f]))*\"" <<;>> << #define HashTableSize 999 #define StringTableSize 5000 #define GLOBAL 0 #define PARAMETER 1 #define LOCAL 2 static Sym *globals = NULL; /* global scope for symbols */ main() { zzs_init(HashTableSize, StringTableSize); ANTLR(p(), stdin); } pScope(p) Sym *p; { for (; p!=NULL; p=p->scope) { printf("\tlevel %d | %-12s | %-15s\n", p->level, zztokens[p->token], p->symbol); } } >> Page 15 PCCTS Advanced Tutorial 1.0x p : <> ( func | "var" def[&globals, GLOBAL] ";" )* <

> "@" ; def[Sym **scope, int level] : <> ( WORD <level = $level; var->token = VAR; >> | VAR <level ) { zzs_scope($scope); var = zzs_newadd($1.text); var->level = $level; var->token = VAR; } else printf("redefined variable ignored: %s\n", $1.text); >> ) ; Page 16 PCCTS Advanced Tutorial 1.0x func : <> WORD <level = GLOBAL; var->token = FUNC; >> "\(" { def[&locals, PARAMETER] } "\)" "\{" ( "var" def[&locals, LOCAL] ";" )* ( statement )* "\}" <

> ; statement : expr ";" | "\{" ( statement )* "\}" | "if" "\(" expr "\)" statement {"else" statement} | "while" "\(" expr "\)" statement | "return" expr ";" | "print" expr ";" ; Page 17 PCCTS Advanced Tutorial 1.0x expr : VAR "=" expr | expr0 ; expr0 : expr1 ( ( "==" | "!=" ) expr1 )* ; expr1 : expr2 ( ( "\+" | "\-" ) expr2 )* ; expr2 : expr3 ( ( "\*" | "/" ) expr3 )* ; expr3 : {"\-" } expr4 ; expr4 : STRING | VAR | ( FUNC | WORD ) "\(" { expr } "\)" | "\(" expr "\)" ; #token WORD "[a-zA-Z]+" <<{ Sym *p = zzs_get(LATEXT(1)); if ( p != NULL ) NLA = p->token; }>> 2.9. File sym.h The following is a modification of the sym.h template provided with PCCTS. The fields of the symbol table entry structure have aug- mented for our purposes as outlined above. Page 18 PCCTS Advanced Tutorial 1.0x /* define some hash function */ #ifndef HASH #define HASH(p, h) while ( *p != '\0' ) h = (h<<1) + *p++; #endif typedef struct symrec { char * symbol; struct symrec *next, *prev, **head, *scope; unsigned hash; int token; /* either FUNC or VAR */ int level; /* either LOCAL, GLOBAL, PARAMETER */ int offset; /* offset from sp on the stack */ /* locals are - offset, param is 0 */ /* used only tut4; reserved */ } Sym, *SymPtr; void zzs_init(); void zzs_done(); void zzs_add(); Sym *zzs_get(); void zzs_del(); void zzs_keydel(); Sym **zzs_scope(); Sym *zzs_rmscope(); void zzs_stat(); Sym *zzs_new(); Sym *zzs_newadd(); char *zzs_strdup(); 2.10. Makefile (for use with symbol table management) The following makefile can be used to make the above language description into an executable file. We assume that the ANTLR includes and standard attribute packages are located in a directory accessible to us as current working directory. The grammar listed above must be in file tut2.g. Page 19 PCCTS Advanced Tutorial 1.0x # # Makefile for 1.00 tutorial # ANTLR creates parser.dlg, err.c, tut1.c # DLG creates scan.c # CFLAGS= -I../h GRM=tut2 SRC=scan.c $(GRM).c err.c sym.c OBJ=scan.o $(GRM).o err.o sym.o tutorial: $(OBJ) $(SRC) cc -o $(GRM) $(OBJ) $(GRM).c parser.dlg : $(GRM).g antlr -k 2 $(GRM).g scan.c : parser.dlg dlg -C2 parser.dlg scan.c 2.11. Sample input/output The current state of the program accepts input like the following sample. var i; var j; f(k) { var local; var j; if ( "true" ) local = "zippo"; } g() { var note; note = "1"; while ( note ) { i = "456"; } } The output of our executable, tut, would be Page 20 PCCTS Advanced Tutorial 1.0x Locals for f: level 2 | VAR | j level 2 | VAR | local level 1 | VAR | k Locals for g: level 2 | VAR | note Globals: level 0 | FUNC | g level 0 | FUNC | f level 0 | VAR | j level 0 | VAR | i Note that the parameter k is level 1 for PARAMETER and the local vari- able local is level 2 for LOCAL. 3. Translate sc to stack code Generating code for a stack machine is simple and can be done by simply adding printf() actions to the grammar in the appropriate places. We begin with a discussion of the stack machine and how to gen- erate code for it. Next we augment our grammar with actions to dump stack code to stdout. 3.1. A simple stack machine for sc Our stack machine consists of a single CPU, a finite stack of strings, a finite memory, a stack pointer (sp) and a frame pointer (fp). All data items used by stack programs are strings (currently set to a maximum length of 100). Our string stack grows downwards towards 0 from the maximum stack height. To make implementation simple, our stack code will actually be a sequence of macro invocations in a C program. This way C will take care of control-flow and allocating space etc... The minimum stack code program defines a _main and includes sc.h which contains all of the stack code macros: #include "sc.h" _main() { BEGIN; END; } The BEGIN and END macros are explained below. Page 21 PCCTS Advanced Tutorial 1.0x 3.1.1. Functions Every sc program requires a main which will we translate to _main (a C function main will call _main). Other functions are translated verbatim. The parameter to your main program will be the first com- mand line argument when you execute your sc program (after translation to C). If no command line argument is specified, a "" is pushed. Parameters are pushed on the string stack before a function is called, so no argument is needed to the resulting C function. Return values are implicitly strings but are also returned on the string stack - avoiding the need to define a return type of the C function. An "instruction" (macro) called BEGIN is executed before any other in a function. This saves the current frame pointer and then makes it point to the argument passed into the function. END is used to restore the frame pointer to its previous value and make the stack pointer point to the return value. After a function call, the top of stack is always the return value. Arguments and local variables are at offsets to the frame pointer. The optional argument is at offset 0. The first local vari- able is at offset 1, the second at offset 2 and so on. Graphically, | . | | . | | arg | fp + 0 | 1st local | fp - 1 | 2nd local | fp - 2 | ... | | . | | . | A dummy argument is pushed by the calling function if no argument is specified. To translate a function, f(arg) { } we simply dump the following Page 22 PCCTS Advanced Tutorial 1.0x f() { BEGIN; END; } Note that the argument disappears because we pass arguments on the string stack not the C hardware stack. 3.1.2. Variables Global variables of the form var name; are translated to SCVAR name; where SCVAR is a C type defined in sc.h that makes space for a sc variable. Local variables are allocated on the string stack and are created at run time via the instruction LOCAL. Each execution of LOCAL allo- cates space for one more local variable on the stack. LOCAL's must be executed after the BEGIN but before any other stack code instruction. The END macro resets the stack pointer so that these locals disappear after each function invocation. For example, main() { var i; var j; } is translated to: Page 23 PCCTS Advanced Tutorial 1.0x #include "sc.h" _main() { BEGIN; LOCAL; LOCAL; END; } 3.1.3. Expressions Operator precedence is defined implicitly in top-down (LL) gram- mars. The deeper the level of recursion, the higher the precedence. To generate code for operators, one prints out the correct macro for that operator after the two operators have been seen (because we are generating code for a stack machine). If the operator is a unary operator like negate, we wait until the operand is seen and then print the operator. For instance, expr1, expr1 : expr2 ( ("\+"|"\-") expr2)* ; would be translated as: expr1 : <> expr2 ( ( "\+" <> | "\-" <> ) expr2 <> )* ; where ADD and SUB are sc stack machine macros defined below. The assignment operator groups right to left and must generate code that duplicates itself after each assignment so that the value of the expression is on the stack for any chained assignement. For exam- ple, Page 24 PCCTS Advanced Tutorial 1.0x main() { var a; var b; a = b = "test"; } would be translated to: #include "sc.h" _main() /* main() */ { BEGIN; LOCAL; /* var a; */ LOCAL; /* var b; */ SPUSH("test"); /* a = b = "test"; */ DUP; LSTORE(2); /* store into b */ DUP; LSTORE(1); /* store into a */ POP; END; } Since an expression is a statement, it is possible that a value could be left on the stack. For example, main() { "expression"; } We must pop this extraneous value off the stack: #include "sc.h" _main() { BEGIN; SPUSH("expression"); POP; END; } Page 25 PCCTS Advanced Tutorial 1.0x 3.1.4. Instructions All instructions take operands from the stack and return results on the stack. Some have no side effects on the stack but alter the C program counter. PUSH(v) Push the value of variable v onto the stack. SPUSH(s) Push the value of the string constant s onto the stack. LPUSH(n) Push the value of the local variable or function argument at offset n onto the stack. STORE(v) Pop the top of stack and store it into variable v. LSTORE(n) Pop the top of stack and store it into local variable or function argument at offset n. POP Pop the top of stack; stack is one string smaller. No side effects with data memory. LOCAL Create space on the stack for a local variable. Must appear after BEGIN but before any other instruction in a function. BRF(a) Branch, if top of stack is "false" or "0", to "address" a (C label); top of stack is popped off. BRT(a) Branch, if top of stack is "true" or "1", to "address" a (C label); top of stack is popped off. BR(a) Branch to "address" a (C label). Stack is not touched. CALL(f) Call the function f. Returns with result on stack top. PRINT Pop the top of stack and print the string to stdout. RETURN Set the return value to the top of stack. Return from the function. EQ Perform stack[sp] = (stack[sp+1] == stack[sp]). NEQ Perform stack[sp] = (stack[sp+1] != stack[sp]). ADD Perform stack[sp] = (stack[sp+1] + stack[sp]). SUB Perform stack[sp] = (stack[sp+1] - stack[sp]). MUL Perform stack[sp] = (stack[sp+1] * stack[sp]). DIV Perform stack[sp] = (stack[sp+1] / stack[sp]). NEG Perform stack[sp] = -stack[sp]. Page 26 PCCTS Advanced Tutorial 1.0x DUP Perform PUSH(stack[sp]). BEGIN Function preamble. END Function cleanup. Boolean operations yield string constants "false" for false and true for true. All other operations yield string constants represent- ing numerical values ("3"+"4" is "7"). 3.2. Examples The sample factorial program from above, main(n) { print fact(n); print "\n"; } fact(n) { if ( n == "0" ) return "1"; return n * fact(n - "1"); } would be translated to: #include "sc.h" _main() /* main(n) */ { /* { */ BEGIN; LPUSH(0); /* print fact(n); */ CALL(fact); PRINT; SPUSH("\n"); /* print "\n"; */ PRINT; END; /* } */ } Page 27 PCCTS Advanced Tutorial 1.0x fact() /* fact(n) */ { BEGIN; LPUSH(0); /* if ( n == "0" ) */ SPUSH("0"); EQ; BRF(iflabel0); SPUSH("1"); /* return "1"; */ RETURN; iflabel0: ; LPUSH(0); /* return n * fact(n - "1"); */ LPUSH(0); SPUSH("1"); SUB; CALL(fact); MUL; RETURN; END; /* } */ } 3.3. Augmenting grammar to dump stack code In order to generate code, we must track the offset of local variables. To do so, we add a field, offset, to our symbol table record: typedef struct symrec { char * symbol; struct symrec *next, *prev, **head, *scope; unsigned hash; int token; /* either FUNC or VAR */ int level; /* either LOCAL, GLOBAL, PARAMETER */ int offset; /* offset from sp on the stack */ /* locals are negative offsets, param is 0 */ } Sym, *SymPtr; We begin modifying our grammar by making a global variable that tracks the offset of local variables and defining a variable to track label numbers: static int current_local_var_offset, LabelNum=0; Because the translation of all programs must include the sc stack machine definitions, we add a printf to rule p: Page 28 PCCTS Advanced Tutorial 1.0x p : <> ( func | "var" def[&globals, GLOBAL] ";" )* << p = zzs_rmscope(&globals); >> "@" ; In order to generate the variable definition macros and update the symbol table, we modify rule def as follows: def[Sym **scope, int level] : <> ( WORD <level = $level; var->token = VAR; var->offset = current_local_var_offset++; switch(var->level) { case GLOBAL: printf("SCVAR %s;\n", $1.text); break; case LOCAL : printf("\tLOCAL;\n"); break; } >> | VAR <level ) { zzs_scope($scope); var = zzs_newadd($1.text); var->level = $level; var->token = VAR; var->offset = current_local_var_offset++; switch(var-> level) { case GLOBAL: printf("\tSCVAR %s;\n",$1.text);break; case LOCAL : printf("\tLOCAL;\n"); break; } } else printf("redefined variable ignored: %s\n",$1.text); >> ) ; The function definition rule must now dump a function template to stdout and generate the BEGIN and END macros. func must also update current_local_var_offset. Note that the code to dump symbols is gone. Page 29 PCCTS Advanced Tutorial 1.0x func : <> WORD <level = GLOBAL; var->token = FUNC; if (strcmp("main",$1.text)) { printf("%s()\n",$1.text); } else printf("_main()\n"); >> "\(" ( def[&locals, PARAMETER] | <> ) "\)" "\{" << printf("{\n\tBEGIN;\n"); >> ( "var" def[&locals, LOCAL] ";" )* ( statement )* "\}" << printf("\tEND;\n}\n"); >> << p = zzs_rmscope(&locals); >> ; Statements are easy to handle since expr generates most of the code. statement : <> expr ";" <> | "\{" ( statement )* "\}" | "if" "\(" expr "\)" << n = LabelNum++; printf("\tBRF(iflabel%d);\n",n); >> statement << printf("iflabel%d: ;\n",n); >> {"else" statement} | "while" << n = LabelNum++; printf("wbegin%d: ;\n", n); >> "\(" expr "\)" << printf("\tBRF(wend%d);\n",n); >> statement << printf("\tBR(wbegin%d);\n", n); >> << printf("wend%d: ;\n",n); >> << n++; >> | "return" expr ";" << printf("\tRETURN;\n"); >> | "print" expr ";" << printf("\tPRINT;\n"); >> ; 3.4. Full translator The following grammar accepts programs in sc and translates them into stack code. Page 30 PCCTS Advanced Tutorial 1.0x #header <<#include "sym.h" #include "charbuf.h" >> #token "[\t\ ]+" << zzskip(); >> /* Ignore White */ #token "\n" << zzline++; zzskip(); >> #token STRING "\"(~[\0-\0x1f\"\\]|(\\~[\0-\0x1f]))*\"" <<;>> << #define HashTableSize 999 #define StringTableSize 5000 #define GLOBAL 0 #define PARAMETER 1 #define LOCAL 2 static Sym *globals = NULL; /* global scope for symbols */ static int current_local_var_offset, LabelNum=0; main() { zzs_init(HashTableSize, StringTableSize); ANTLR(p(), stdin); } pScope(p) Sym *p; { for (; p!=NULL; p=p->scope) { printf("\tlevel %d | %-12s | %-15s\n", p->level, zztokens[p->token], p->symbol); } } >> Page 31 PCCTS Advanced Tutorial 1.0x p : <> ( func | "var" def[&globals, GLOBAL] ";" )* << p = zzs_rmscope(&globals); >> "@" ; def[Sym **scope, int level] : <> ( WORD <level = $level; var->token = VAR; var->offset = current_local_var_offset++; switch(var->level) { case GLOBAL: printf("SCVAR %s;\n", $1.text); break; case LOCAL : printf("\tLOCAL;\n"); break; } >> | VAR <level ) { zzs_scope($scope); var = zzs_newadd($1.text); var->level = $level; var->token = VAR; var->offset = current_local_var_offset++; switch(var-> level) { case GLOBAL: printf("\tSCVAR %s;\n", $1.text); break; case LOCAL : printf("\tLOCAL;\n"); break; } } else printf("redefined variable ignored: %s\n", $1.text); >> ) ; Page 32 PCCTS Advanced Tutorial 1.0x func : <> WORD <level = GLOBAL; var->token = FUNC; if (strcmp("main",$1.text)) { printf("%s()\n",$1.text); } else printf("_main()\n"); >> "\(" ( def[&locals, PARAMETER] | <> ) "\)" "\{" << printf("{\n\tBEGIN;\n"); >> ( "var" def[&locals, LOCAL] ";" )* ( statement )* "\}" << printf("\tEND;\n}\n"); >> << p = zzs_rmscope(&locals); >> ; statement : <> expr ";" <> | "\{" ( statement )* "\}" | "if" "\(" expr "\)" << n = LabelNum++; printf("\tBRF(iflabel%d);\n",n); >> statement << printf("iflabel%d: ;\n",n); >> {"else" statement} | "while" << n = LabelNum++; printf("wbegin%d: ;\n", n); >> "\(" expr "\)" << printf("\tBRF(wend%d);\n",n); >> statement << printf("\tBR(wbegin%d);\n", n); >> << printf("wend%d: ;\n",n); >> << n++; >> | "return" expr ";" << printf("\tRETURN;\n"); >> | "print" expr ";" << printf("\tPRINT;\n"); >> ; Page 33 PCCTS Advanced Tutorial 1.0x expr : <> VAR "=" expr << printf("\tDUP;\n"); >> <level != GLOBAL ) printf("\tLSTORE(%d);\n", s->offset); else printf("\tSTORE(%s);\n", s->symbol); >> | expr0 ; expr0 : <> expr1 ( ( "==" <> | "!=" <> ) expr1 <> )* ; expr1 : <> expr2 ( ( "\+" <> | "\-" <> ) expr2 <> )* ; expr2 : <> expr3 ( ( "\*" <> | "/" <> ) expr3 <> )* ; Page 34 PCCTS Advanced Tutorial 1.0x expr3 : <> {"\-" <>} expr4 <> ; expr4 : <> STRING << printf("\tSPUSH(%s);\n", $1.text); >> | VAR << s = zzs_get($1.text); if ( s->level != GLOBAL ) printf("\tLPUSH(%d);\n", s->offset); else printf("\tPUSH(%s);\n", s->symbol); >> | ( FUNC <<$0=$1;>> | WORD <<$0=$1;>> ) "\(" { <> expr <> } "\)" <> | "\(" expr "\)" ; #token WORD "[a-zA-Z]+" <<{ Sym *p = zzs_get(zzlextext); if ( p != NULL ) NLA = p->token; }>> 3.5. Makefile for Full Translator The makefile is no different from the one used for the symbol table management version of the tutorial except that we need to change the GRM make variable as follows: GRM=tut3 which implies that tut3.g is the file containing the third revision of our sc translator. 3.6. Use of translator Once the translator has been made using the makefile, it is ready to translate sc programs. To translate and execute the factorial example from above, we execute the following: tut3 < fact.c > temp.c Page 35 PCCTS Advanced Tutorial 1.0x where fact.c is the file containing the factorial code. temp.c will contain the translated program. It is valid C code because it is nothing more than a bunch of macro invocations (the macros are defined in sc.h). We can compile temp.c like any other C program: cc temp.c To execute the program, we type: a.out n where n is the number you want the factorial of. Typing: a.out 8 yields: 40320.000000 3.7. Stack code macros - sc.h The following file is included by any C program using the stack code instructions outlined above. Page 36 PCCTS Advanced Tutorial 1.0x /* sc.h -- string C stack machine macros * * For use with PCCTS advanced tutorial version 1.0x */ #include #include #include /* * The function invocation stack looks like: * * | . | * | . | * | arg | fp + 0 arg is "" if none specified * | 1st local | fp - 1 * | 2nd local | fp - 2 * | ... | * | . | * | . | */ #define STR_SIZE 100 #define STK_SIZE 200 /* define stack */ typedef struct { char text[STR_SIZE]; } SCVAR; static SCVAR stack[STK_SIZE]; static int sp = STK_SIZE, fp; /* number begins with number or '.' followed by number. All numbers * are converted to floats before comparison. */ #define SCnumber(a) (isdigit(a[0]) || (a[0]=='.' && isdigit(a[1]))) Page 37 PCCTS Advanced Tutorial 1.0x #define TOS stack[sp].text #define NTOS stack[sp+1].text #define TOSTRUE ((strcmp(TOS, "true")==0)||(strcmp(TOS, "1")==0) \ ||(SCnumber(TOS)&&atof(TOS)==1.0) ) #define TOSFALSE ((strcmp(TOS, "false")==0)||(strcmp(TOS, "0")==0) \ ||(SCnumber(TOS)&&atof(TOS)==0.0) ) #define PUSH(a) {if ( sp==0 ) {fprintf(stderr, "stk ovf!\n"); exit(-1);} \ strcpy(stack[--sp].text, (a).text);} #define SPUSH(a) {if ( sp==0 ) {fprintf(stderr, "stk ovf!\n"); exit(-1);} \ strcpy(stack[--sp].text, a);} #define LPUSH(a) {if ( sp==0 ) {fprintf(stderr, "stk ovf!\n"); exit(-1);} \ stack[--sp] = stack[fp-a];} #define CALL(f) {f();} #define POP stack[sp++] #define LOCAL {if ( sp==0 ) {fprintf(stderr, "stk ovf!\n"); exit(-1);} \ stack[--sp].text[0] = '\0';} #define BRF(lab) if ( TOSFALSE ) {POP; goto lab;} else POP; #define BRT(lab) if ( TOSTRUE ) {POP; goto lab;} else POP; #define BR(lab) goto lab #define PRINT printf("%s", POP.text) #define RETURN {strcpy(stack[fp].text, TOS); END; return;} #define STORE(v) {v = POP;} #define LSTORE(off) {stack[fp-off] = POP;} Page 38 PCCTS Advanced Tutorial 1.0x /* operators */ #define EQ {char *a,*b; float c,d; \ a = POP.text; b = POP.text; \ if ( SCnumber(a) && SCnumber(b) ) { \ c=atof(a); d=atof(b); \ if ( c == d ) {SPUSH("true");} \ else SPUSH("false"); \ } \ else if ( strcmp(a, b)==0 ) {SPUSH("true");}\ else SPUSH("false");} #define NEQ {SCVAR a,b; float c,d; \ a = POP.text; b = POP.text; \ if ( SCnumber(a) && SCnumber(b) ) { \ c=atof(a); d=atof(b); \ if ( c != d ) {SPUSH("true");} \ else SPUSH("false"); \ } \ else if ( strcmp(a, b)!=0 ) {SPUSH("true");}\ else SPUSH("false");} #define ADD {SCVAR c; float a,b; \ if ( !SCnumber(NTOS) || !SCnumber(TOS) ) { \ strncat(NTOS, TOS, STR_SIZE - strlen(NTOS));\ sp++; \ } else { \ a=atof(POP.text); b=atof(POP.text);\ sprintf(c.text, "%f", a+b); PUSH(c);\ }} #define SUB {SCVAR c; float a,b; a=atof(POP.text); b=atof(POP.text); \ sprintf(c.text, "%f", b-a); PUSH(c);} #define MUL {SCVAR c; float a,b; a=atof(POP.text); b=atof(POP.text); \ sprintf(c.text, "%f", a*b); PUSH(c);} #define DIV {SCVAR c; float a,b; a=atof(POP.text); b=atof(POP.text); \ sprintf(c.text, "%f", b/a); PUSH(c);} #define NEG {SCVAR c; float a; a=atof(POP.text); \ sprintf(c.text, "%f", -a); PUSH(c);} #define DUP {if ( sp==0 ) {fprintf(stderr, "stk ovf!\n"); exit(-1);} \ stack[sp-1] = stack[sp]; --sp;} Page 39 PCCTS Advanced Tutorial 1.0x #define BEGIN int save_fp = fp; fp = sp #define END sp = fp; fp = save_fp; main(argc, argv) int argc; char *argv[]; { if ( argc == 2 ) {SPUSH(argv[1]);} else SPUSH(""); CALL(_main); POP; } 4. Intermediate form construction This section describes how trees can be used as an intermediate form of the sc source program and how our tut2.g symbol table manage- ment example can be modified to construct these trees. Example tree constructions are given as well as the complete source. Note that this section and the code generation section essen- tially duplicate the translation achieved in the previous section on stack code. We arrive at the solution from a different angle, how- ever. This method is much more general and would be used for "real" grammars. 4.1. Tree Construction To construct an intermediate form (IF) representation of an sc program, we will construct trees using the abstract syntax tree (AST) mechanism provided with PCCTS. PCCTS supports an automatic and an explicit tree construction mechanism; we will use both. We will aug- ment the grammar that has the symbol table management actions - tut2.g. In order to use the AST mechanism within PCCTS, we must do the following: o Define the AST fields, AST_FIELDS, needed by the user. o Define the default AST node constructor, zzcr_ast(), that con- verts from a lexeme to an AST node. o Define an explicit AST node constructor - zzmk_ast() (because we use the explicit tree constructor mechanism also). Our tree nodes will contain a token number to identify the node. This token number can also be a value for which there is no corresponding lexical item (e.g. a function call may have a node called DefineVar). If the token indicates that the node represents a variable or function, we must have information as to its level, offset Page 40 PCCTS Advanced Tutorial 1.0x from the frame pointer (if a local or parameter) and the string representing the name of the item. This information can all be held via: #define AST_FIELDS int token, level, offset; char str[D_TextSize]; which will be added to the #header PCCTS pseudo-op action. Note that D_TextSize is defined in charbuf.h. To tell PCCTS how it should build AST nodes for use with the automatic tree construction mechanism, we define the following macro in the #header action: #define zzcr_ast(ast,attr,tok,text) create_ast(ast,attr,tok,text) which merely calls a function we define as follows: create_ast(ast,attr,tok,text) AST *ast; Attrib *attr; int tok; char *text; { Sym *s; ast->token = tok; if ( tok == VAR ) { s = zzs_get(text); ast->level = s->level; ast->offset = s->offset; } if ( tok == STRING || tok == VAR || tok == FUNC ) strcpy(ast->str, text); } Because of the symbol table lookup etc... we make the node constructor a function rather than have the code replicated at each invocation of the zzcr_ast macro. When we explicitly construct trees, the automatic node construc- tor is generally disabled and we must explicitly call a function to create an AST node: Page 41 PCCTS Advanced Tutorial 1.0x AST * zzmk_ast(node, token, str) AST *node; int token; char *str; { Sym *s; node->token = token; if ( token == VAR ) { s = zzs_get(str); node->level = s->level; node->offset = s->offset; } if ( tok == STRING || tok == VAR || tok == FUNC ) strcpy(node->str, str); return node; } This function will be invoked when we have a grammar action with a reference of the form #[token, string]. In order to print out the trees that we have defined, let us define a simple function to do a preorder, lisp-like walk of our trees: lisp(tree) AST *tree; { while ( tree!= NULL ) { if ( tree->down != NULL ) printf(" ("); if ( tree->token == STRING || tree->token == VAR || tree->token == FUNC ) printf(" %s", tree->str); else printf(" %s", zztokens[tree->token]); lisp(tree->down); if ( tree->down != NULL ) printf(" )"); tree = tree->right; } } PCCTS automatically assumes that all rules will produce trees and that all terminals within rules will result in nodes added to the current tree. In the case of rules p, expr4, statement and func, this is not true. We will explicitly handle the trees within these rules. Therefore, we append a ! after the definition of these rules. We need to decide how each sc construct will be translated to a subtree and Page 42 PCCTS Advanced Tutorial 1.0x when we will pass these trees off to the code generator. We will col- lect all subtrees within a function and then, just before we remove the symbols for that function, we will pass this tree off to lisp to be printed; we will pass it off to the code generator when we define it in the next section. After it has been printed, the function trees will be destroyed. For global variable definitions, we will pass each one individually off to lisp and then destroy the tree. Before modifying the rules of the grammar, we must define labels for a number of regular expressions so that the token value of that expression can be referenced from within an action. We also define labels for tokens which do not exist physically in a program, but are needed for code generation. /* Not actual terminals, just node identifiers */ #token DefineFunc #token SLIST /* Define tokens for code generation */ #token DefineVar "var" #token Mul " #token Div "/" #token Add "+" #token Sub "-" #token Equal "==" #token NotEqual "!=" #token If "if" #token While "while" #token Return "return" #token Print "print" #token Assign "=" All PCCTS grammars that use the AST mechanism need to pass the address of a NULL pointer to the starting rule. main() { AST *root=NULL; zzs_init(HashTableSize, StringTableSize); ANTLR(p(&root), stdin); } Rule p is modified as follows: Page 43 PCCTS Advanced Tutorial 1.0x p! : <> ( func | "var" def[&globals, GLOBAL] ";" <> )* <

> "@" ; The #[DefineVar] reference calls our zzmk_ast macro to create an AST node whose token is set to DefineVar. The reference to #(#[DefineVar], #2) is a tree constructor which makes the DefineVar node the parent of the tree returned by rule def - #2. In general, the tree constructor has the form: #( root, sibling1, sibling2, ..., siblingN ) where any NULL pointer terminates the list except for the root pointer. #2 refers to the second AST node or tree within an alternative - def in our case. Rule def itself does not change because PCCTS automatically creates a node for the WORD or VAR and passes it back to the invoking rule. Trees for functions are constructed in the following form: DefineFunc | v FUNC -> DefineVar -> DefineVar -> SLIST | | | v | v optional_arg | stat1 -> ... > statn v local1 -> ... -> localn where DefineFunc and DefineVar are dummy tokens used only in trees (as opposed to the grammar). Rule func is modified to build these trees: Page 44 PCCTS Advanced Tutorial 1.0x func! : <> WORD <level = GLOBAL; var->token = FUNC; >> "\(" ( def[&locals, PARAMETER] <> | <> ) "\)" "\{" ( "var" def[&locals, LOCAL] <> ";" )* ( statement <> )* "\}" <> ; Variable parm is set to the AST node created by def if a parameter is found else it is NULL. Variable v is used to maintain a list of local variables. The tree constructor v = #(NULL,v,#2); says to make a new tree with no root and with siblings v and #2 - which effectively links #2 to the end of the current list of vari- ables. Variable s behaves in a similar fashion. After the list of variables and statements have been accumulated, we add a dummy node (DefineVar, SLIST) as a root to each list so that we get the structure given above. To examine the trees we have constructed, we make a call to lisp and then destroy the tree. Even though it is not necessary to do so, we will build trees for statement explicitly (in general, the automatic mechanism is best for expressions with simple operator/operand relationships). Page 45 PCCTS Advanced Tutorial 1.0x statement!: <> expr ";" <<#0 = #1;>> | "\{" ( statement <> )* "\}" <<#0 = #(#[SLIST], s);>> | "if" "\(" expr "\)" statement {"else" statement <>} <<#0 = #(#[If], #3, #5,el);>> | "while" "\(" expr "\)" statement <<#0 = #(#[While], #3,#5);>> | "return" expr ";" <<#0 = #(#[Return], #2);>> | "print" expr ";" <<#0 = #(#[Print], #2);>> ; Many of the tokens in statement are not included in the tree because they are a notational convenience for humans or are used for grouping purposes. As before, we track lists of statements using the tree con- structor: #(NULL,s,#1). The last unusual rule is expr: expr4! : <> STRING <<#0 = #[STRING, $1.text];>> | VAR <<#0 = #[STRING, $1.text];>> | ( FUNC <> | WORD <> ) "\(" { expr <> } "\)" <<#0 = #(f, arg);>> | "\(" expr "\)" <<#0 = #2;>> ; Assigning a tree to #0 makes that tree the return tree. However, it should really only be used in rules that do not use the automatic tree construction mechanism; i.e. there is a ! after the rule definition. For a function call, we make a tree with a FUNC node at the root and the optional argument as the child. The third alternative merely returns what is computed by expr; the parenthesis are mechanism for syntactic precedence and add nothing to the tree. The tree structure itself embodies the precedence. The only other required modifications to the grammar are to indi- cate which tokens are operators (subtree roots) and which are operands in the expression rules. To indicate that a token is to be made a subtree root, we append a ^. All other tokens are considered operands (children). Appending a token with ! indicates that it is not to be included in the tree. For example, to make trees like: Page 46 PCCTS Advanced Tutorial 1.0x operator | v opnd1 -> opnd2 we modify the operators in expr through expr3 in a fashion similar to: expr1 : expr2 ( ( "\+"^ | "\-"^ ) expr2 )* ; and expr3 : { "\-"^ } expr4 ; 4.2. Full Grammar to Construct Trees The following grammar constructs AST's as an intermediate form (IF) of an sc program and prints out the IF in lisp-like preorder. #header <<#include "sym.h" #include "charbuf.h" #define AST_FIELDS int token, level, offset; char str[D_TextSize]; #define zzcr_ast(ast,attr,tok,text) create_ast(ast,attr,tok,text) #define DONT_CARE 0 #define SIDE_EFFECTS 1 #define VALUE 2 >> #token "[\t\ ]+" << zzskip(); >> /* Ignore White */ #token "\n" << zzline++; zzskip(); >> #token STRING "\"(~[\0-\0x1f\"\\]|(\\~[\0-\0x1f]))*\"" <<;>> /* Not actual terminals, just node identifiers */ #token DefineFunc #token SLIST Page 47 PCCTS Advanced Tutorial 1.0x /* Define tokens for code generation */ #token DefineVar "var" #token Mul "\*" #token Div "/" #token Add "\+" #token Sub "\-" #token Equal "==" #token NotEqual "!=" #token If "if" #token While "while" #token Return "return" #token Print "print" #token Assign "=" <<;>> << #define HashTableSize 999 #define StringTableSize 5000 #define GLOBAL 0 #define PARAMETER 1 #define LOCAL 2 static Sym *globals = NULL; /* global scope for symbols */ static int current_local_var_offset=0; create_ast(ast,attr,tok,text) AST *ast; Attrib *attr; int tok; char *text; { Sym *s; ast->token = tok; if ( tok == VAR ) { s = zzs_get(text); ast->level = s->level; ast->offset = s->offset; } if ( tok == STRING || tok == VAR || tok == FUNC ) strcpy(ast->str, text); } Page 48 PCCTS Advanced Tutorial 1.0x AST * zzmk_ast(node, token, str) AST *node; int token; char *str; { Sym *s; node->token = token; if ( token == VAR ) { s = zzs_get(str); node->level = s->level; node->offset = s->offset; } if ( token == STRING || token == VAR || token == FUNC ) strcpy(node->str, str); return node; } lisp(tree) AST *tree; { while ( tree!= NULL ) { if ( tree->down != NULL ) printf(" ("); if ( tree->token == STRING || tree->token == VAR || tree->token == FUNC ) printf(" %s", tree->str); else printf(" %s", zztokens[tree->token]); lisp(tree->down); if ( tree->down != NULL ) printf(" )"); tree = tree->right; } } main() { AST *root=NULL; zzs_init(HashTableSize, StringTableSize); ANTLR(p(&root), stdin); } Page 49 PCCTS Advanced Tutorial 1.0x pScope(p) Sym *p; { for (; p!=NULL; p=p->scope) { printf("\tlevel %d | %-12s | %-15s\n", p->level, zztokens[p->token], p->symbol); } } >> p! : <> ( func | "var" def[&globals, GLOBAL] ";" <> )* <

> "@" ; Page 50 PCCTS Advanced Tutorial 1.0x def[Sym **scope, int level] : <> ( WORD <level = $level; var->token = VAR; var->offset = current_local_var_offset++; >> | VAR <level ) { zzs_scope($scope); var = zzs_newadd($1.text); var->level = $level; var->token = VAR; var->offset = current_local_var_offset++; } else printf("redefined variable ignored: %s\n", $1.text); >> ) ; func! : <> WORD <level = GLOBAL; var->token = FUNC; >> "\(" ( def[&locals, PARAMETER] <> | <> ) "\)" "\{" ( "var" def[&locals, LOCAL] <> ";" )* ( statement <> )* "\}" <> ; Page 51 PCCTS Advanced Tutorial 1.0x statement!: <> expr ";" <<#0 = #1;>> | "\{" ( statement <> )* "\}" <<#0 = #(#[SLIST], s);>> | "if" "\(" expr "\)" statement {"else" statement <>} <<#0 = #(#[If], #3, #5, el);>> | "while" "\(" expr "\)" statement <<#0 = #(#[While], #3, #5);>> | "return" expr ";" <<#0 = #(#[Return], #2);>> | "print" expr ";" <<#0 = #(#[Print], #2);>> ; expr : VAR "="^ expr | expr0 ; expr0 : expr1 ( ( "=="^ | "!="^ ) expr1 )* ; expr1 : expr2 ( ( "\+"^ | "\-"^ ) expr2 )* ; expr2 : expr3 ( ( "\*"^ | "/"^ ) expr3 )* ; Page 52 PCCTS Advanced Tutorial 1.0x expr3 : { "\-"^ } expr4 ; expr4! : <> STRING <<#0 = #[STRING, $1.text];>> | VAR <<#0 = #[VAR, $1.text];>> | ( FUNC <> | WORD <> ) "\(" { expr <> } "\)" <<#0 = #(f, arg);>> | "\(" expr "\)" <<#0 = #2;>> ; #token WORD "[a-zA-Z]+" <<{ Sym *p = zzs_get(LATEXT(1)); if ( p != NULL ) NLA = p->token; }>> 4.3. Example translations To illustrate how functions get translated, consider the follow- ing sc program: var a; main(n) { var b; } Our translator would create trees represented by: ( DefineVar WORD ) ( DefineFunc FUNC ( DefineVar WORD ) ( DefineVar WORD ) SLIST ) where WORD represents the variables found on the input stream. SLIST has no child because there were no statements in the function. Some example statement transformations follow. if: Page 53 PCCTS Advanced Tutorial 1.0x if ( b == "hello" ) print n; translates to ( If ( Equal b "hello" ) ( Print n ) ) while: while ( i != "10" ) { print i; i = i+"1"; } translates to ( While ( NotEqual i "10" ) ( SLIST ( Print i ) ( = i ( Add i "1" ) ) ) ) Expressions have the operator at the root and the operands as children. For example, i = a + "3" * "9" + f("jim"); would be translated to: ( = i ( Add ( Add a ( Mul "3" "9" ) ) ( f "jim" ) ) ) which looks like: Page 54 PCCTS Advanced Tutorial 1.0x = | v i -> Add | v Add --------------> f | | v v a -> Mul "jim" | v "3" -> "9" 5. Code generation from intermediate form To translate the intermediate form (IF) to the stack code described above, we present the following simple little code genera- tor. We use no devious C programming to make this fast or nifty because this document concentrates on how PCCTS grammars are developed not on how to write spectacular C code. 5.1. Modification of tree construction grammar To convert from the translator above that printed out trees in lisp format, we add the following definitions needed as arguments to gen() in the #header action. #define DONT_CARE 0 #define SIDE_EFFECTS 1 #define VALUE 2 Also, we replace all calls to lisp(tree) with gen(tree, DONT_CARE). 5.2. Code generator Function gen() presented below accepts an AST as an argument and walks the tree generating output when necessary. You will notice all the header information needed by this file, code.c. This illustrates one of the minor inconveniences of PCCTS. Any C file that needs to access PCCTS variables, rules, etc... must include all of the things that PCCTS generates at the head of all PCCTS generated files. The code generator accepts an evaluation mode argument so that it can remove some unnecessary code. For example, Page 55 PCCTS Advanced Tutorial 1.0x main() { "3" + f(); } f() {} would result in #include "sc.h" _main() { BEGIN; CALL(f); POP; END; } f() { BEGIN; END; } Code to push and add "3" was eliminated because it had no side effects. Only function calls have side effects in sc. Page 56 PCCTS Advanced Tutorial 1.0x /* * Simple code generator for PCCTS 1.0x tutorial * * Terence Parr * Purdue University * Electrical Engineering * March 19, 1992 */ #include #include "sym.h" #include "charbuf.h" #define AST_FIELDS int token, level, offset; char str[D_TextSize]; #define zzcr_ast(ast,attr,tok,text) create_ast(ast,attr,tok,text) #include "antlr.h" #include "ast.h" #include "tokens.h" #include "dlgdef.h" #include "mode.h" #define GLOBAL 0 #define PARAMETER 1 #define LOCAL 2 static int LabelNum=0, GenHeader=0; #define DONT_CARE 0 #define SIDE_EFFECTS 1 #define VALUE 2 gen(t,emode) AST *t; int emode; /* evaluation mode member { SIDE_EFFECTS, VALUE, DONT_CARE } */ { AST *func, *arg, *locals, *slist, *a, *opnd1, *opnd2, *lhs, *rhs; int n; if ( t == NULL ) return; if ( !GenHeader ) { printf("#include \"sc.h\"\n"); GenHeader=1; } switch ( t->token ) { Page 57 PCCTS Advanced Tutorial 1.0x case DefineFunc : func = zzchild(t); arg = zzchild( zzsibling(func) ); locals = zzchild( zzsibling( zzsibling(func) ) ); slist = zzsibling( zzsibling( zzsibling(func) ) ); if ( strcmp(func->str, "main") == 0 ) printf("_%s()\n{\n\tBEGIN;\n", func->str); else printf("%s()\n{\n\tBEGIN;\n", func->str); for (a=locals; a!=NULL; a = zzsibling(a)) printf("\tLOCAL;\n"); gen( slist, DONT_CARE ); printf("\tEND;\n}\n"); break; case SLIST : for (a=zzchild(t); a!=NULL; a = zzsibling(a)) { gen( a, EvalMode(a->token) ); if ( a->token == Assign ) printf("\tPOP;\n"); } break; case DefineVar : printf("SCVAR %s;\n", zzchild(t)->str); break; Page 58 PCCTS Advanced Tutorial 1.0x case Mul : case Div : case Add : case Sub : case Equal : case NotEqual : opnd1 = zzchild(t); opnd2 = zzsibling( opnd1 ); gen( opnd1, emode ); gen( opnd2, emode ); if ( emode == SIDE_EFFECTS ) break; switch ( t->token ) { case Mul : printf("\tMUL;\n"); break; case Div : printf("\tDIV;\n"); break; case Add : printf("\tADD;\n"); break; case Sub : if ( opnd2==NULL ) printf("\tNEG;\n"); else printf("\tSUB;\n"); break; case Equal : printf("\tEQ;\n"); break; case NotEqual : printf("\tNEQ;\n"); break; } break; case If : a = zzchild(t); gen( a, VALUE ); /* gen code for expr */ n = LabelNum++; printf("\tBRF(iflabel%d);\n", n); a = zzsibling(a); gen( a, EvalMode(a->token) ); /* gen code for statement */ printf("iflabel%d: ;\n", n); break; case While : a = zzchild(t); n = LabelNum++; printf("wbegin%d: ;\n", n); gen( a, VALUE ); /* gen code for expr */ printf("\tBRF(wend%d);\n", n); a = zzsibling(a); gen( a, EvalMode(a->token) ); /* gen code for statement */ printf("\tBR(wbegin%d);\n", n); printf("wend%d: ;\n", n); break; Page 59 PCCTS Advanced Tutorial 1.0x case Return : gen( zzchild(t), VALUE ); printf("\tRETURN;\n"); break; case Print : gen( zzchild(t), VALUE ); printf("\tPRINT;\n"); break; case Assign : lhs = zzchild(t); rhs = zzsibling( lhs ); gen( rhs, emode ); printf("\tDUP;\n"); if ( lhs->level == GLOBAL ) printf("\tSTORE(%s);\n", lhs->str); else printf("\tLSTORE(%d);\n", lhs->offset); break; case VAR : if ( emode == SIDE_EFFECTS ) break; if ( t->level == GLOBAL ) printf("\tPUSH(%s);\n", t->str); else printf("\tLPUSH(%d);\n", t->offset); break; case FUNC : gen( zzchild(t), VALUE ); printf("\tCALL(%s);\n", t->str); break; case STRING : if ( emode == SIDE_EFFECTS ) break; printf("\tSPUSH(%s);\n", t->str); break; Page 60 PCCTS Advanced Tutorial 1.0x default : printf("Don't know how to handle: %s\n", zztokens[t->token]); break; } } EvalMode( tok ) int tok; { if ( tok == Assign || tok == If || tok == While || tok == Print || tok == Return ) return VALUE; else return SIDE_EFFECTS; } 5.3. Makefile for code generator This makefile is roughly the same as before except that we need to tell PCCTS to generate tree information and we need to link in the code generator. CFLAGS= -I../h GRM=tut4 SRC=scan.c $(GRM).c err.c sym.c code.c OBJ=scan.o $(GRM).o err.o sym.o code.o tutorial: $(OBJ) $(SRC) cc -o $(GRM) $(OBJ) $(GRM).c parser.dlg : $(GRM).g antlr -k 2 -gt $(GRM).g scan.c : parser.dlg dlg -C2 parser.dlg scan.c Page 61 Table of Contents 1. sc Language definition ....................................... 2 1.1. sc Expressions .................................... 2 1.2. sc Functions and statements ....................... 2 1.3. Example ........................................... 3 2. Language recognition ......................................... 3 2.1. Lexical analysis .................................. 4 2.2. Attributes ........................................ 4 2.3. Grammatical analysis .............................. 5 2.3.1. Definitions, variables ................. 5 2.3.2. Functions .............................. 5 2.3.3. Statements ............................. 6 2.3.4. Expressions ............................ 6 2.4. Complete ANTLR description (w/o symbol table management) ..................................................... 8 2.5. Makefile .......................................... 10 2.6. Testing ........................................... 10 2.7. Symbol table management ........................... 10 2.8. Complete ANTLR description (with symbol table management) ..................................................... 15 2.9. File sym.h ........................................ 18 2.10. Makefile (for use with symbol table management) 2.11. Sample input/output .............................. 20 3. Translate sc to stack code ................................... 21 3.1. A simple stack machine for sc ..................... 21 3.1.1. Functions .............................. 22 3.1.2. Variables .............................. 23 3.1.3. Expressions ............................ 24 3.1.4. Instructions ........................... 26 3.2. Examples .......................................... 27 3.3. Augmenting grammar to dump stack code ............. 28 3.4. Full translator ................................... 30 3.5. Makefile for Full Translator ...................... 35 3.6. Use of translator ................................. 35 3.7. Stack code macros - sc.h .......................... 36 4. Intermediate form construction ............................... 40 4.1. Tree Construction ................................. 40 4.2. Full Grammar to Construct Trees ................... 47 4.3. Example translations .............................. 53 5. Code generation from intermediate form ....................... 55 5.1. Modification of tree construction grammar ......... 55 5.2. Code generator .................................... 55 5.3. Makefile for code generator ....................... 61 Page 1