SSymbolic

How Symbolic compiles itself

This document walks through sigil/runes/symc/src/main.sym — the Symbolic compiler written in Symbolic — and the bootstrap that proves it correct. If you have read the Language Guide, you already know every construct it uses.

   text on stdin                                         binary on stdout
   -------------->  lex  -->  parse + IR  -->  codegen  -->  ELF / PE / wasm …
                    |              |               |
                 tokens      IR instructions   raw machine code +
                            (VReg/label IR)    format headers

The compiler is five runes concatenated by build-symc.sh:

sigil/runes/std/src/main.sym    — standard library (Vec, HashMap, …)
sigil/runes/lex/src/main.sym    — tokenizer (longest-match)
sigil/runes/parse/src/main.sym  — Pratt parser, builds AST
sigil/runes/ir/src/main.sym     — semantic analysis + IR lowering
sigil/runes/back/src/main.sym   — thirteen-target code generator + object format writers

The assembled single-file build lives in sigil/runes/symc/src/main.sym (~462 KB source, ~510 KB compiled binary). No feature outside the Guide is used — the language is expressive enough to describe its own compiler.


1. The bootstrap

A self-hosting compiler must reproduce itself exactly. install.sh drives a four-stage chain and asserts a byte-identical fixpoint:

   symc0     = cargo build (Rust seed)
   symc.lcc  = symc0(runes:lcc)        x64-only, no argv needed to compile it
   symc.s1   = symc.lcc(runes:lmain)   multi-target driver compiled for the first time
   symc.s2   = symc.s1(runes:lmain)    compiler compiled by itself
   symc.s3   = symc.s2(runes:lmain)    ...and once more
   assert symc.s2 == symc.s3           byte-identical ⇒ fixpoint (~505 KB)

symc.s2 and symc.s3 are both runes:lmain compiled by a rune-derived compiler. If codegen is deterministic they must be byte-identical. They are.

Once the fixpoint holds, the Rust seed is no longer needed: Symbolic builds Symbolic.

The stage0/tests/bootstrap.rs test reproduces this fixpoint on every cargo test run (gated to x86_64-linux).


2. Data model

All compiler state lives in hash cells (Guide §9), so functions share it without passing it around. Key cells in sigil/runes/ir/src/main.sym:

#src #slen #pos        the source bytes, their length, and the lexer cursor
#tk #tval #toff #tlen  the current token: kind, integer value, name span
#insn #nins            IR instruction buffer (VReg ops)
#data #dlen            the data segment (cell initialisers, then strings)
#cells #ncell          table of declared cells   (name span -> data offset)
#fn #nfn               table of functions        (name span -> IR range)
#loc #nloc             the current function's locals (name span -> VReg)

The backend (sigil/runes/back/src/main.sym) allocates physical registers with a linear-scan allocator, encodes instructions to raw bytes per target, then writes the object format (ELF / Mach-O / PE / wasm / SPIR-V).


3. The lexer (sigil/runes/lex/src/main.sym)

:next advances #pos past whitespace and ::: comments, then classifies the next token into a discriminant stored in #tk (value in #tval, source span in #toff/#tlen). It implements longest-match greedy scanning — for example the - family:

   --- -> modulo      --= -> less-than      --< -> rotate-left
   --  -> divide      -=  -> less-or-equal   -<  -> shift-left
   -?  -> else        -&+ -> xor             -&  -> bitwise-not
   -   -> subtract

The lexer reads bytes straight from #src with :ld8, comparing against ASCII codes — no string library required. Token discriminants are validated by sigil/runes/lex/test.sh against symc0 --dump-tokens on the full corpus.


4. Parser + IR lowering (parse/ + sigil/runes/ir/src/main.sym)

The parser is a Pratt parser: :cnud handles a primary (literal, register, cell, call, parenthesised group, unary), then :cexp consumes following infix operators at sufficient binding power. Binding powers reproduce the reference compiler's precedence table, so 2 + 3 ++ 4 parses as 2 + (3 ++ 4).

The IR lowering in sigil/runes/ir/src/main.sym is syntax-directed: it reuses the parser's expression cursor and lowers directly to VReg/label IR as it parses, matching symc0's allocation order exactly. This is validated construct-by- construct against symc0 --dump-ir on 146 test programs by sigil/runes/ir/test.sh.


5. Code generation (sigil/runes/back/src/main.sym)

The backend:

  1. Runs a linear-scan register allocator over the VReg IR.
  2. Encodes each instruction to raw bytes for the selected target (#tgt 0–12): x64 (Linux/macOS/Windows/UEFI/FreeBSD), AArch64 (Linux/macOS/iOS/Android), RISC-V 64, LoongArch64, WebAssembly, SPIR-V.
  3. Bakes in a tiny runtime (bump-pointer heap allocator, I/O, integer printing, integer power) so the output is a fully self-contained static binary — no libc, no linker.
  4. Writes the object format header (ELF / Mach-O / PE / wasm module) and resolves relocations.

6. Reproduce it yourself

bash install.sh              # builds the full bootstrap chain; asserts s2 == s3
source ~/.symbolic/env
symc < sigil/runes/symc/src/main.sym > /tmp/symc_repro && chmod +x /tmp/symc_repro
cmp /tmp/symc_repro ~/.symbolic/bin/symc && echo "fixpoint OK"

cargo test -p symc0 --test bootstrap   # the same check as a Rust test

Then open sigil/runes/symc/src/main.sym and read it — it is, by construction, a program you already know how to read.