LECTURE 1
Author: Robin Cockett
Date: January 7, 2002
WHAT IS A COMPILER
(Chapter 1: Louden & ASU)
Compiler as a translator:
_____________
|
|
source language -------|
COMPILER |--->
target language
|_____________|
A compiler is a program which reads in a program in a source language and translates it into a program (often executable) in a target language . We might expect that the program produced will be a "good translation" in some measurable sense ...
An interpreter (officially) is not a compiler. An
interpreter reads
in a source program and produces the result of running or executing that
program. This involves actually executing the program in
some fashion! The technology required for compiling is also
required to produce an
interpreter as usually an interpreter has an internal abstract machine
code which it can excecute and into which the source programs must be
compiled. Many modern programming language systems blurr
the traditional distinction between interpreter and compiler as they
support both!
Examples:
A compiler is usually just one step in a number of steps which lead
to
the production of "absolute machine code" ... however it is perhaps the
biggest
step!
________________
___________
_______________
_________
|
|
|
|
|
|
|
|
->-| preprocess
|-->--| compile |-->--|
assemble |-->--|
link
|->
|________________|
|___________|
|_______________| |_________|
source
source
assembler
relocatable
absolute
object
code
machine code
THE STEPS OF COMPILING
Overview of a compiler:
|
String
_____|_______
____
|
Lexical
|
|
|
Analysis
|
|
Syntax:
|
(Scanning)
|
|
|____________|
|
|
|
O(|C|)
Token list
______|_________
|
|
Grammatical
|
|
|
Analysis
|
|
|
(Parsing)
|
|_____
|______________|
Syntax
Tree
|
(AST)
|
____
___________
______|______
_________ |
| Symbol
| | Semantic
| |
Error | | Semantics
| Table
| |
Analysis
| | Handler
| | O(|C| log |C|)
|__________|
|_____________|
|__________| |
|
|___
Intermediate
|
representation |
____
(IR)
|
|
_____|_______
|
|
Optim-
|
|
Most
of the
|
ization
|
|
problems
here
|____________|
|
are
NP-complete
(IR)
|
|
_____|_______
|
approximate
|
Code
|
|
algorithms
|
Generat'n
|
|
are
appropriate!
|_____________|
|
Assembler
|
|____
Let us follow an expression through a compiler:
position :=
initial + rate * 60
The first question one must ask is whether this is a valid
expression
in the given source language. For this we need
(a) A formal definition of the language
(b) An effective membership test
(c) A plan for how to handle failure (i.e. error recovery!)
Usually the syntax of a languages is specified in two stages:
Group characters together to form lexemes which are translated into tokens which are the terminals of the grammar used in the next step.
Lexeme | Token |
"127" | INT(000000001111111) |
"length" | ID("length") |
"+" | ADD |
This can be efficiently implemented using deterministic finite
automata (DFA). These are often specified using "patterns" which
are essentially regular expressions (we will eventually be using LEX
).
The structure of the language is defined by a context free language and it is determined whether the stream of tokens belong to the grammar.
----
example grammar for mini language ----
prog -> stmt
stmt -> IF expr THEN stmt ELSE stmt
|
ID
ASSIGN expr
|
PRINT
expr
|
BEGIN
stmtlist END
stmtlist -> stmt morestmtlist
|
morestmtlist -> SEMICOLON stmtlist
|
expr -> term moreterms
term -> factor morefactors
|
SUB
term
factor -> LPAR expr RPAR
|
ID
|
NUM
morefactors -> MUL factor morefactors
|
DIV factor morefactors
|
moreterms -> ADD term moreterms
|
SUB
term moreterms
|
Efficient parsers are built using pushdown automata (PDA) for LL(n)
and
LR(n) grammars use shift-reduce parsing. A very simple parsing
technique
is "recursive descent parsing" which requires an LL(n) grammar.
Consider again the assignment
position := initial + rate * 60
Does this syntactically legal expression actually mean anything? How do we tell?
The main component of semantic checking is checking that variables
have been declared and type checking expressions.
At the end of this stage we know that we have a legal program and we
generate an intermediate representation of the
code. This representation may be sufficiently close to machine
code that it
facilitates the translation to machine code and yet sufficiently high
level
that it facilitates optimization. A typical intermediate code is
"three
address code". Each instruction in this code is a simple
assignment
consisting of an assigned variable a binary or unary operation and its
two
arguments. The above assignment might translate to:
temp1
:= int2real(60)
temp2
:= id3 * temp1
temp3
:= id2 + temp2
id1
:= temp3
Modern compilers use often higher level intermediate languages
which have associated transformations which can be reduced to
this form.
The intermediate representation is optimized using program transformation techniques:
These procedures applied to the above code may have the following effect:
temp2
:= id3 * 60.0
id1
:= id2 + temp1
Intermediate code is translated to a suitable assembler code.
Here
is the SPARC assembler code for this!
ld [%fp-16],%f4
ld [%fp-12],%f3
sethi
%hi(.L_cseg2),%l0
or %l0,%lo(.L_cseg2),%l0
ld [%l0+0],%f2
fmuls
%f3,%f2,%f2
fadds
%f4,%f2,%f2
st %f2,[%fp-8]
To obtain this I actually wrote a short C program
main()
{
float position,rate,initial;
initial = 12.0;
rate = 3;
position = initial + rate *
60;
printf ("%f %f %f \n",
position,initial,rate);
}
and then compiled it with the -S option. This
produces assembler alongside the code. If you want to see how
statements should be translated into assembler this reverse engineering
technique is invaluable .... when you get to the code generation stage
(next course CPSC510).
STEP 7: Assembler optimizations
In generating assembler there are a number of important issues:
QUALITIES OF A COMPILER ..