How Symbolic compiles itself
This document walks through sigil/runes/symc/src/main.sym — the Symbolic compiler
written in Symbolic — and the bootstrap that proves it correct. If you have
read the Language Guide, you already know every construct
it uses.
text on stdin binary on stdout
--------------> lex --> parse + IR --> codegen --> ELF / PE / wasm …
| | |
tokens IR instructions raw machine code +
(VReg/label IR) format headers
The compiler is five runes concatenated by build-symc.sh:
sigil/runes/std/src/main.sym — standard library (Vec, HashMap, …)
sigil/runes/lex/src/main.sym — tokenizer (longest-match)
sigil/runes/parse/src/main.sym — Pratt parser, builds AST
sigil/runes/ir/src/main.sym — semantic analysis + IR lowering
sigil/runes/back/src/main.sym — thirteen-target code generator + object format writers
The assembled single-file build lives in sigil/runes/symc/src/main.sym (~462 KB source,
~510 KB compiled binary). No feature outside the Guide is used — the language is
expressive enough to describe its own compiler.
1. The bootstrap
A self-hosting compiler must reproduce itself exactly. install.sh drives a
four-stage chain and asserts a byte-identical fixpoint:
symc0 = cargo build (Rust seed)
symc.lcc = symc0(runes:lcc) x64-only, no argv needed to compile it
symc.s1 = symc.lcc(runes:lmain) multi-target driver compiled for the first time
symc.s2 = symc.s1(runes:lmain) compiler compiled by itself
symc.s3 = symc.s2(runes:lmain) ...and once more
assert symc.s2 == symc.s3 byte-identical ⇒ fixpoint (~505 KB)
symc.s2 and symc.s3 are both runes:lmain compiled by a rune-derived
compiler. If codegen is deterministic they must be byte-identical. They are.
Once the fixpoint holds, the Rust seed is no longer needed: Symbolic builds Symbolic.
The stage0/tests/bootstrap.rs test reproduces this fixpoint on every cargo test run (gated to x86_64-linux).
2. Data model
All compiler state lives in hash cells (Guide §9), so functions share it
without passing it around. Key cells in sigil/runes/ir/src/main.sym:
#src #slen #pos the source bytes, their length, and the lexer cursor
#tk #tval #toff #tlen the current token: kind, integer value, name span
#insn #nins IR instruction buffer (VReg ops)
#data #dlen the data segment (cell initialisers, then strings)
#cells #ncell table of declared cells (name span -> data offset)
#fn #nfn table of functions (name span -> IR range)
#loc #nloc the current function's locals (name span -> VReg)
The backend (sigil/runes/back/src/main.sym) allocates physical registers with a
linear-scan allocator, encodes instructions to raw bytes per target, then writes
the object format (ELF / Mach-O / PE / wasm / SPIR-V).
3. The lexer (sigil/runes/lex/src/main.sym)
:next advances #pos past whitespace and ::: comments, then classifies the
next token into a discriminant stored in #tk (value in #tval, source span in
#toff/#tlen). It implements longest-match greedy scanning — for example the
- family:
--- -> modulo --= -> less-than --< -> rotate-left
-- -> divide -= -> less-or-equal -< -> shift-left
-? -> else -&+ -> xor -& -> bitwise-not
- -> subtract
The lexer reads bytes straight from #src with :ld8, comparing against ASCII
codes — no string library required. Token discriminants are validated by
sigil/runes/lex/test.sh against symc0 --dump-tokens on the full corpus.
4. Parser + IR lowering (parse/ + sigil/runes/ir/src/main.sym)
The parser is a Pratt parser: :cnud handles a primary (literal, register,
cell, call, parenthesised group, unary), then :cexp consumes following infix
operators at sufficient binding power. Binding powers reproduce the reference
compiler's precedence table, so 2 + 3 ++ 4 parses as 2 + (3 ++ 4).
The IR lowering in sigil/runes/ir/src/main.sym is syntax-directed: it reuses the
parser's expression cursor and lowers directly to VReg/label IR as it parses,
matching symc0's allocation order exactly. This is validated construct-by-
construct against symc0 --dump-ir on 146 test programs by sigil/runes/ir/test.sh.
5. Code generation (sigil/runes/back/src/main.sym)
The backend:
- Runs a linear-scan register allocator over the VReg IR.
- Encodes each instruction to raw bytes for the selected target (
#tgt0–12): x64 (Linux/macOS/Windows/UEFI/FreeBSD), AArch64 (Linux/macOS/iOS/Android), RISC-V 64, LoongArch64, WebAssembly, SPIR-V. - Bakes in a tiny runtime (bump-pointer heap allocator, I/O, integer printing, integer power) so the output is a fully self-contained static binary — no libc, no linker.
- Writes the object format header (ELF / Mach-O / PE / wasm module) and resolves relocations.
6. Reproduce it yourself
bash install.sh # builds the full bootstrap chain; asserts s2 == s3
source ~/.symbolic/env
symc < sigil/runes/symc/src/main.sym > /tmp/symc_repro && chmod +x /tmp/symc_repro
cmp /tmp/symc_repro ~/.symbolic/bin/symc && echo "fixpoint OK"
cargo test -p symc0 --test bootstrap # the same check as a Rust test
Then open sigil/runes/symc/src/main.sym and read it — it is, by
construction, a program you already know how to read.