A brief overview of compiler design.
In Summer 2015, I built my own parser, Sasstree, to power a code linter. It was a fascinating side project (and a refreshing change of pace from my day-job writing a web application). And it helped me better understand a question I'd been wondering about: how do computers make sense of what we write?
A compiler is a computer program that translates computer programs - either turning source code into machine code (like compiling C++ code into a 64-bit Windows binary) or to another language (like how the Typescript and Babel compilers output JavaScript1).
And compilers can process more than just source code! Markdown compilers like Marked output HTML, and GraphQL and SQL use compilers to parse queries. Heck, JavaScript's JSON.parse
is a compiler that parses JSON text into JavaScript variables!
Compilers process code in stages. First, they need to make sense of the text they've been given by tokenizing and parsing it. This is tackled by the "front-end" of the compiler. After that, the "backend" of the compiler can make optimizations and generate the final result.
The frontend's job is to turn structured text into something that's easier for the computer to work with - an abstract syntax tree.
The tokenizer (sometimes called the lexer or scanner) is the first stage of the compiler, and the first component of the frontend. It breaks the source code into "words", called tokens.
Tokens are characters or strings that have particular meaning, like a number, a string, or or keywords (like var
or function
). Here's how we might tokenize a simple math equation:
// tokenize("2 + 3 = 5");
[
{type: 'Number', value: '2'},
{type: 'Operator', value: '+'},
{type: 'Number', value: '3'},
{type: 'EqualSign', value: '='},
{type: 'Number', value: '5'},
]
You could parse strings into tokens using regular expressions, like using /[0-9]+/
to match a number token. However, regular expressions are slow - and notoriously tricky! - so it's probably safer to use simple conditionals & loops.
Parsing the input text into "words", instead of one unbroken stream of characters, might not look like much but it makes the compiler's next job easier.
The parser takes the "words" created by the tokenizer and groups them into "sentences"2, based on the grammatical rules of the source language. This process is called syntactic analysis.
The result of this stage is an abstract syntax tree. This is a tree data structure that makes it easier to think about logical parts of the source code & how they relate to each other:
// parse([
// {type: 'Number', value: '2',
// {type: 'Operator', value: '+'},
// {type: 'Number', value: '3'},
// {type: 'EqualSign', value: '='},
// {type: 'Number', value: '5'},
// ]);
{
type: 'Equation',
leftSide: {
type: 'Expression',
operator: '+',
leftOperand: {
type: 'Number',
value: '2'
},
rightOperand: {
type: 'Number',
value: '3'
},
},
rightSide: {
type: 'Number',
value: '5'
}
}
š” You can play with the syntax tree for my SCSS parser on the live demo!
The grammar tells the parser how tokens may (and may not) be combined with each other. If it finds something it wasn't expecting, this is where a "syntax error" might occur:
2 + + 3 = 5
// ā syntax error! unexpected token '+'. expected 'Number'.
Parsers can read tokens in a couple different ways, depending on the complexity of the language's grammar. The simplest option is a "top-down" parser.
Top-down, or recursive descent, parsers read the token stream from left to right. The parser uses functions to process each type of element that could exist in the grammar.
To parse math equations like in the examples above, we'd write functions like parseEquation
, parseExpression
and parseNumber
. Each method is responsible for parsing that particular type of thing, and hands off parsing to other functions when it makes sense.
A top-down parser always starts with the "outermost" grammar rule, like parseEquation
. That function would then hand off to parseExpression
, which in turn might call parseNumber
.
Sometimes, the parser might also need to "peek" ahead to determine how to interpret a given token - like figuring out whether a NUMBER
token is a standalone number or the first term in an equation.
Languages that can be parsed this way are called LL(1) grammars, because the tokens are parsed left-to-right, the parser always produces a leftmost derivation (a fancy way of saying it builds the tree from the top-down), and it peeks at most one symbol ahead.
Compilers like gcc, golang, and V8 use handwritten recursive-descent parsers.
More complicated languages that can't be unambiguously parsed top-to-bottom might use a "bottom-up" parser, which starts with the bottom "leaf" nodes on the AST and builds "up" from there. These types of parsers are too complicated to write by hand, and are therefore generally created by tools like ANTLR or Bison (which powers the parsers for Ruby and PHP).
One notable advantage of LR parsers is that their grammars can include left recursion, because they already know about the children when evaluating a parent node that might include left recursion.
Languages that can be parsed this way are called LR grammars because they parse the provided tokens left-to-right, but use the rightmost derivation (which is just a fancy way of saying they build the AST from bottom-to-top).
Now that the compiler understands the text it was given, it's time to do something with it! The backend's job is to optimize and transform the parsed code into the desired output format.
(The "backend" for my parser, Sasstree, was a code linter. It traversed the AST and ran functions to check code style on each node, and then output their suggestions to the console).
Optimization is the compiler's superpower. It allows humans to focus on writing readable source code, while the computer finds ways to optimize it for a machine. (That usually means "faster", but it could also mean using less memory or being more power-efficient).
Since the compiler has a flawless memory & a perfect understanding of how code is structured, it can make optimizations that would be too tedious for any programmer to bother with.
In many modern languages, like Rust and Swift, the compiler optimization step is handled by LLVM. LLVM is able to optimize different types of languages by using a common "intermediate representation" for the optimization step. The LLVM project also includes front-end and back-end components for a slew of languages.3
There are also compilers that solely focus on optimization. For example, UglifyJS renames JavaScript variables to space-efficient (but completely illegible) names like n
and p
. It can also remove "dead" blocks of code that are never used. (And experimental tools like Prepack take this even further by actually evaluating code in the compiler when possible!)
Of course, these tools still just create other (smaller and faster) JavaScript files. Web browsers have their own compilers that then parse, optimize, and execute that code.4
The code generator step takes this optimized version of the code & writes it into the compiler's target language. This might be instructions for a particular machine, another high-level language... or just running the code right then and there!
If the target is machine code, then this is where the compiler makes platform-specific optimizations (like picking processor registers to store variables in). This is how the same program can be compiled for different types of processors - like a 32-bit or 64-bit binary for Windows, or an Intel or PowerPC binary for macOS.
Other compilers output high-level code, like JavaScript. After optimizations are performed, the compiler traverses the syntax tree one last time. For each type of node, it runs a "print" or toString
function to write it in the target language.
Some compilers aren't meant to output code at all! These compilers are called interpreters. Instead of walking the tree and running functions to print compiled machine or high-level code, they "run" each type of node directly from the tree.
It can be slow to re-compile the same functions over and over, so many interpreted languages5 instead use just-in-time compilers. These languages are still run on-demand, but the compiler keeps track of how many times each function is called. If one gets popular, it's optimized & compiled to machine code for a performance boost.
Early computers were programmed by writing binary. Not only was this dreadfully tedious, these instructions were specific to a particular computer and had to be rewritten from scratch if you ever wanted to run a program on a different machine. Ouch.
The first compiler was invented by Grace Hopper in 1952 as a way to make programming less tedious by using English words instead of 0s and 1s. It paved the way for the introduction of high-level programming languages, like COBOL, that are still used to this day.
Many of the ideas that form the basis for modern compilers, like LR grammars (invented by Donald Knuth) or just-in-time compilation (invented by John McCarthy for LISP), were also created in the 60s.
Compilers: Principles, Techniques, and Tools ā colloquially known as "The Dragon Book" because of its absurd cover art. It's dense, but covers almost anything you could want to know about compilers.
Writing An Interpreter In Go - this book walks through building a fully tested parser & interpreter for a simple programming language in Go. (This is a really fun excuse to play with Go!)
jamiebuilds/the-super-tiny-compiler - a simple well-annotated compiler for a LISP-like language, written in JavaScript.
LL & LR Parsing Demystified - the best explanation of LL & LR parsing I've seen, with a clear explanation of the pros & cons of each
And some compilers output the exact same language they take in! Prettier takes in JavaScript, parses it to understand the code's structure, and outputs it with standard formatting rules. And jscodeshift takes in JavaScript code and applies codemods to edit it programmatically. ā©
This is more than just a convenient analogy! Linguists use phrase structure rules to build syntax trees out of human language, just like compilers do for code. If we were writing a compiler for English, our tokenizer would tag words as nouns, verbs, and adjectives. Our parser would wire them together (for example, identifying the subject of a sentence or tying an adjective to the noun it describes). ā©
You can even compile LLVM-based languages like C, C++, Swift, or Rust to WebAssembly (via emscripten) and run them in a web browser! Woah! ā©
And surprisingly, web browsers actually start the optimization process in the parser! They "skim" code using a preparser, which does the bare minimum work needed to identify all the functions in a JavaScript file (but skips most of the contents). Then, the parser only has to fully parse each function if it's run later on! ā©
Examples include Ruby's MJIT Compiler, the PHP 8 JIT, and WebKit's B3 JIT. ā©