Documentation

Search & Compilers

Search engines and programming language tools share a common challenge: they must process enormous volumes of text character by character, match patterns at high speed, and build structured representations from raw strings. Tries, hash tables, and syntax trees are the workhorses that make autocomplete, symbol lookup, and compilation fast enough to be interactive.

The Character-by-Character Challenge

A search engine index over a billion web pages, or an IDE serving autocomplete to a developer, must find all strings sharing a given prefix in milliseconds. A sorted array of strings with binary search costs O(m log n) per query where m is the query length, and it cannot easily enumerate all matches. A hash table gives O(1) exact lookup but has no concept of prefix — hashing 'car' gives no information about 'card' or 'care'. The mismatch between character-level structure and key-value lookup is what motivates trie-based approaches, which align data structure shape with the shape of the problem.

Tries: Making Prefix Search First-Class

A trie (prefix tree) stores strings by routing each character through a branching node, so that all strings sharing a prefix share a path from the root. Prefix lookup is O(m) in the length of the query, independent of the number of stored strings, and enumerating all completions is a DFS from the matching node. Elasticsearch and Lucene use a compressed trie (FST — finite state transducer) to store their term dictionary, achieving both fast prefix lookup and compact memory usage. Radix trees (Patricia tries) collapse chains of single-child nodes into single edges, reducing memory by 50–80% for typical word lists. Google's search autocomplete, IDE symbol search (VS Code, IntelliJ), and URL routing in web frameworks all use trie variants.

Symbol Tables and the Hash Map

A compiler's symbol table maps identifiers (variable names, function names, type names) to their metadata: type, scope, memory location, and attributes. Hash maps are the universal choice because identifier lookup happens on nearly every line of compiled code, and O(1) average-case performance is essential for compilation speed. Scoping is implemented by chaining hash maps — each scope level gets its own map, and lookup walks the chain from innermost to outermost scope until the identifier is found or the chain is exhausted. Clang, GCC, and the Go compiler all use hash-table-based symbol tables internally. The hash function must distribute identifiers well; poor distribution causes bucket collisions that degrade compilation of large files with many identifiers.

Lexers, Parsers, and Syntax Trees

A compiler's first two passes — the lexer and parser — transform raw source text into a tree. The lexer uses a deterministic finite automaton (DFA), which is a graph of states, to tokenize the input in a single linear pass. The parser then uses the token stream to build an Abstract Syntax Tree (AST), where each node represents a syntactic construct and children represent sub-expressions or statements. Recursive descent parsers construct this tree by direct recursion over grammar rules, while LR parsers use an explicit stack and a state machine table generated from the grammar. Every semantic analysis pass — type checking, borrow checking in Rust, null-safety analysis in Kotlin — is a tree traversal over the AST.

From Source Code to Machine: The Full Pipeline

After parsing, a compiler transforms the AST into an intermediate representation (IR) — typically a control flow graph (CFG) where basic blocks are nodes and edges represent possible control transfers. Optimization passes like dead code elimination, constant folding, and loop invariant code motion are all graph algorithms on the CFG. Register allocation maps IR variables to physical registers and is formally equivalent to graph coloring: interference between variables is an edge, and each color is a register. LLVM's IR and GCC's RTL are both CFG representations, and the dozen optimization passes that run over them are a direct application of graph algorithms. The final instruction selection and scheduling phase also uses DAG-based algorithms to order instructions for maximum pipeline utilization.