I was just talking with @Ayaz Hafiz about an interesting idea: if we have a separate lexer (as @Joshua Warner proposed, which seems reasonable) what if we tried to combine the parsed + canonicalized IRs into one data structure?
having names resolved is actually necessary to implement https://github.com/roc-lang/roc/issues/4431 so this would be a benefit to the formatter
and it would also allow things like sorting defs in-place right after having parsed + canonicalizing them
canonicalization needs to reorder defs into dependency order for purposes of type-checking, so https://github.com/roc-lang/roc/issues/4430 (and the formatter automation of that, https://github.com/roc-lang/roc/issues/4431) actually mean that we want to switch from automatically reordering the defs (like we do today) during canonicalization to instead verifying that they're in dependency order, and giving an error if not (this is what https://github.com/roc-lang/roc/issues/4430 does)
this would mean we could go from 3 IRs to 3 IRs even after introducing lexing, since we'd go from parse -> can -> mono to lex -> parsecan -> mono
and presumably the combined version would run faster!
we also have constraint gen so it would be more like lex -> parse -> constraint gen/ish + solve -> mono
my thought is we can toss some stuff that is done in can right now that is needed for typechecking but not for analyses over the parse AST into the constraint gen pass too
def-sorting is one of those but finding captures, naming anonymous lambdas, resolving alias orderings, etc. are others
all of those analyses can live in look aside buffers instead of the AST, effectively just more SoA
also getting rid of AST types + can types + subs types in favor for subs types out of the door would be huge
when we're saying lexer, what do we mean exactly? does this mean materializing a sequence (say, a Vec) of tokens?
In my (mostly in my head) design, the output of the lexer is something like:
struct Tokens {
token_kinds: Vec<TokenKind>, // a repr(u8) enum
token_offsets: Vec<u32>, // byte offsets into the input, marking the start of each token
comments_before_each_token: Vec<Option<Vec<CommentOrNewline>>>, // optimized out via some genric hacks, so the lexer doesn't actually produce this unless requested/required.
}
right. I think we should investigate that zig does here. From what I understand, it does not materialize this full sequence of tokens
rather, you have some state, and can ask for the next token. Then in many cases you can store a source position, and e.g. generating error messages can re-tokenize from that location to get the various Regions that are relevant for the message
Oh interesting; I was under the impression that it did materialize all the tokens like that. I could definitely be wrong!
rather, you have some state, and can ask for the next token.
This is 100% possible to do - but it's not necessarily a panacea.
If the tokens were any more complicated than a u8 + u32, the 'incremental' approach is definitely preferable.
https://mitchellh.com/zig/tokenizer is what I base this on. Which might be outdated now but this is what I remember
In practice, programming language inputs are not “sufficiently large” so the entire source file is tokenized at once; this is how Zig works today, too.
I believe this refers to that the whole input is in a slice, so it is all in memory at the same time
rather than using some implementor of the Read trait
See the next doc - https://mitchellh.com/zig/parser
const Parser = struct {
gpa: Allocator,
source: []const u8,
token_tags: []const Token.Tag,
token_starts: []const Ast.ByteOffset,
tok_i: TokenIndex,
errors: std.ArrayListUnmanaged(AstError),
nodes: Ast.NodeList,
extra_data: std.ArrayListUnmanaged(Node.Index),
scratch: std.ArrayListUnmanaged(Node.Index),
};
I see a list of tokens there, and no tokenizer state
hmm yeah. Then what does the "tokenizer does not allocate" claim mean?
current main also has the list of tokens https://github.com/ziglang/zig/blob/master/lib/std/zig/Ast.zig#L21
so I guess I got confused and it does store tokens, but not more than that
e.g. to determine a region, you can take a token, then just tokenize/parse from that position to figure out the region
I think "tokenizer does not allocate" means the .next() method itself doesn't allocate
The caller, who is presumably appending to these two lists, obviously does allocate.
Yeah, it doesn't store the end position of each token.
yeah, that does not seem like an interesting property to me any more (too deep into the rabbit hole). But I guess it's good to note that e.g. string literals are not copied to a new allocation
@Andrew Kelley can probably shed some light on what Zig actually does here! :smiley:
hello
tokenizer is here: https://github.com/ziglang/zig/blob/0.10.1/lib/std/zig/tokenizer.zig
tokenization and parsing is done here: https://github.com/ziglang/zig/blob/0.10.1/lib/std/zig/parse.zig#L15-L78
the "lhs"/"rhs" thing I did with the Ast is kinda bad. I recommend to use an untagged union instead of that
example of that here: https://github.com/ziglang/zig/blob/0.10.1/src/Zir.zig
as for the data structures of these things, after a couple years later I am happy with them. you don't need more than 5 bytes per token, and I used 13 bytes per AST node but I think you can get away with 9, because even for a binary operation, you only need to store the lhs and rhs. you can find the operator token trivially with tokenof(rhs) - 1
one more suggestion: any computation you can do on a per-file basis without access to compiler flags or other files, you can do EXTREMELY quickly, and cache it trivially
when your data in memory is built of a handful of arrays, you can yeet it to and from disk with a single writev/readv call (ok 1 more call to learn the lengths of the arrays in the case of reading)
IMO it's not worth caching the results of parsing because it turns out source code is actually quite a compact way of representing an AST. however if you do any computation on top of the AST then caching could be beneficial
I used 13 bytes per AST node but I think you can get away with 9, because even for a binary operation, you only need to store the lhs and rhs
@Andrew Kelley Hmm, which 9 bytes would those be?
In the blog post linked above I see:
pub const Node = struct {
tag: Tag,
main_token: TokenIndex,
data: Data,
pub const Data = struct {
lhs: Index,
rhs: Index,
};
pub const Index = u32;
};
I'm assuming you mean removing the backing array for main_token? If so, how would you map back to the source location for things like error messages?
having a main_token would be a fine, conservative choice. The alternative I am hinting at would look something like this:
node_tags: []Node.Tag,
node_datas: []Node.Data,
string_bytes: []const u8,
extra: []const u32,
pub const Node = struct {
tag: Tag,
data: Data,
pub const Tag = enum(u8) {
/// Pointer deref syntax (`*x`)
/// Uses the `op_tok` union field.
/// Token is the asterisk.
deref,
/// Function call syntax (`a(b, c, etc)`)
/// Uses the `tok_payload` union field.
/// Token is the open parenthesis.
/// Payload points to a Call.
call,
/// Addition syntax (`a + b`)
/// Uses the `bin` union field.
add,
/// Subtraction syntax (`a - b`)
/// Uses the `bin` union field.
sub,
};
pub const Data = union {
op_tok: struct {
operand: Index,
token: TokenIndex,
},
tok_payload: struct {
token: TokenIndex,
/// Index into the extra array
payload: u32,
},
bin: struct {
lhs: Index,
rhs: Index,
},
// Make sure we don't accidentally add a field to make this union
// bigger than expected. Note that in Debug builds, Zig is allowed
// to insert a secret field for safety checks.
comptime {
if (builtin.mode != .Debug and builtin.mode != .ReleaseSafe) {
assert(@sizeOf(Data) == 8);
}
}
};
/// Use a non-exhaustive enum instead of an integer for type safety.
/// You can give it methods for convenience, or even give it
/// special tags (example with `none` below).
pub const Index = enum (u32) {
none = std.math.maxInt(u32),
_,
pub fn toInt(i: Index) u32 {
assert(i != .none);
return @enumToInt(i);
}
};
pub const TokenIndex = enum (u32) { _ };
/// Trailing is:
/// * arg: Index for each args_len
pub const Call = struct {
callee: Index,
args_len: u32,
};
};
for error messages all you need is a Node.Index or a TokenIndex
computing line/column based on this information is
with this AST encoding, it would require a helper function such as node.firstToken() to find the token based on a node. to get the index of the operator for a binary operation, you would do bin.rhs.firstToken(ast) - 1
or maybe ast.nodeFirstToken(bin.rhs) - 1 depending on which namespace you want to put the method in
https://github.com/ziglang/zig/blob/0.10.1/lib/std/zig/Ast.zig#L381 zig's firstToken() function
The way I like to think about this stuff is that you are coming up with an encoding that is a form of bespoke compression for part of your compiler's state. You are, effectively, making your pipeline operate on compressed input, perform computations directly on a compressed encoding, and output different, also compressed data. The performance wins come from the fact that because the data is compressed, and yet does not ever have to be converted between uncompressed/compressed forms, the computer ends up doing less work.
Ahhh I see - so somewhere in the leaves of the tree, you always have sufficient information to accurately return firstToken directly. And all of the higher-level nodes can compute firstToken by delegating to their left child (and possibly also walking back a small number of tokens before that).
Clever!
One issue came to mind with combining parsed and canonicalized IRs; does this not conflict with our earlier plans to have an AST for the parser and one for the editor with auto-generated conversion functions?
Based on earlier messages I also want to warn against optimizing too early. In general, I believe a piece of code may be optimized after we have high confidence in its correctness because optimizing before that point will make it harder to debug. I also expect that this strategy will get us to code that is fast and reliable more quickly because less time is spent debugging.
I don't have a great understanding of the invariants that canonicalization IR guarantees - so take this with a grain of salt - but my general impression is that it has constraints that are _different enough_ from the parser AST that it's fairly non-trivial to translate.
The direction I'd like to approach the problem is to start with trying to unify (aka automate the translation between) the editor AST and the parser AST - and if it happens to work out that the result of that is usable as the canonicalization IR, great!
My reasoning here is that it seems more important to have the editor/parser ASTs be "symmetric", to reduce the number of bugs / missing features that are just on one side of that divide - than it is to have the parser/canonicalization AST be the same.
Last updated: Jun 16 2026 at 16:19 UTC