Stream: ideas

Topic: combined parse + canonicalization IR


view this post on Zulip Richard Feldman (Feb 27 2023 at 22:15):

I was just talking with @Ayaz Hafiz about an interesting idea: if we have a separate lexer (as @Joshua Warner proposed, which seems reasonable) what if we tried to combine the parsed + canonicalized IRs into one data structure?

view this post on Zulip Richard Feldman (Feb 27 2023 at 22:15):

having names resolved is actually necessary to implement https://github.com/roc-lang/roc/issues/4431 so this would be a benefit to the formatter

view this post on Zulip Richard Feldman (Feb 27 2023 at 22:16):

and it would also allow things like sorting defs in-place right after having parsed + canonicalizing them

view this post on Zulip Richard Feldman (Feb 27 2023 at 22:18):

canonicalization needs to reorder defs into dependency order for purposes of type-checking, so https://github.com/roc-lang/roc/issues/4430 (and the formatter automation of that, https://github.com/roc-lang/roc/issues/4431) actually mean that we want to switch from automatically reordering the defs (like we do today) during canonicalization to instead verifying that they're in dependency order, and giving an error if not (this is what https://github.com/roc-lang/roc/issues/4430 does)

view this post on Zulip Richard Feldman (Feb 27 2023 at 22:19):

this would mean we could go from 3 IRs to 3 IRs even after introducing lexing, since we'd go from parse -> can -> mono to lex -> parsecan -> mono

view this post on Zulip Richard Feldman (Feb 27 2023 at 22:19):

and presumably the combined version would run faster!

view this post on Zulip Ayaz Hafiz (Feb 27 2023 at 22:21):

we also have constraint gen so it would be more like lex -> parse -> constraint gen/ish + solve -> mono

view this post on Zulip Ayaz Hafiz (Feb 27 2023 at 22:21):

my thought is we can toss some stuff that is done in can right now that is needed for typechecking but not for analyses over the parse AST into the constraint gen pass too

view this post on Zulip Ayaz Hafiz (Feb 27 2023 at 22:22):

def-sorting is one of those but finding captures, naming anonymous lambdas, resolving alias orderings, etc. are others

view this post on Zulip Ayaz Hafiz (Feb 27 2023 at 22:22):

all of those analyses can live in look aside buffers instead of the AST, effectively just more SoA

view this post on Zulip Ayaz Hafiz (Feb 27 2023 at 22:23):

also getting rid of AST types + can types + subs types in favor for subs types out of the door would be huge

view this post on Zulip Folkert de Vries (Feb 27 2023 at 23:24):

when we're saying lexer, what do we mean exactly? does this mean materializing a sequence (say, a Vec) of tokens?

view this post on Zulip Joshua Warner (Feb 27 2023 at 23:32):

In my (mostly in my head) design, the output of the lexer is something like:

struct Tokens {
  token_kinds: Vec<TokenKind>, // a repr(u8) enum
  token_offsets: Vec<u32>, // byte offsets into the input, marking the start of each token
  comments_before_each_token: Vec<Option<Vec<CommentOrNewline>>>, // optimized out via some genric hacks, so the lexer doesn't actually produce this unless requested/required.
}

view this post on Zulip Folkert de Vries (Feb 27 2023 at 23:41):

right. I think we should investigate that zig does here. From what I understand, it does not materialize this full sequence of tokens

view this post on Zulip Folkert de Vries (Feb 27 2023 at 23:42):

rather, you have some state, and can ask for the next token. Then in many cases you can store a source position, and e.g. generating error messages can re-tokenize from that location to get the various Regions that are relevant for the message

view this post on Zulip Joshua Warner (Feb 27 2023 at 23:42):

Oh interesting; I was under the impression that it did materialize all the tokens like that. I could definitely be wrong!

view this post on Zulip Joshua Warner (Feb 27 2023 at 23:45):

rather, you have some state, and can ask for the next token.

This is 100% possible to do - but it's not necessarily a panacea.

If the tokens were any more complicated than a u8 + u32, the 'incremental' approach is definitely preferable.

view this post on Zulip Folkert de Vries (Feb 27 2023 at 23:48):

https://mitchellh.com/zig/tokenizer is what I base this on. Which might be outdated now but this is what I remember

view this post on Zulip Joshua Warner (Feb 27 2023 at 23:50):

In practice, programming language inputs are not “sufficiently large” so the entire source file is tokenized at once; this is how Zig works today, too.

view this post on Zulip Folkert de Vries (Feb 27 2023 at 23:51):

I believe this refers to that the whole input is in a slice, so it is all in memory at the same time

view this post on Zulip Folkert de Vries (Feb 27 2023 at 23:51):

rather than using some implementor of the Read trait

view this post on Zulip Joshua Warner (Feb 27 2023 at 23:52):

See the next doc - https://mitchellh.com/zig/parser

view this post on Zulip Joshua Warner (Feb 27 2023 at 23:52):

const Parser = struct {
    gpa: Allocator,
    source: []const u8,

    token_tags: []const Token.Tag,
    token_starts: []const Ast.ByteOffset,
    tok_i: TokenIndex,



    errors: std.ArrayListUnmanaged(AstError),
    nodes: Ast.NodeList,
    extra_data: std.ArrayListUnmanaged(Node.Index),
    scratch: std.ArrayListUnmanaged(Node.Index),
};

view this post on Zulip Joshua Warner (Feb 27 2023 at 23:52):

I see a list of tokens there, and no tokenizer state

view this post on Zulip Folkert de Vries (Feb 27 2023 at 23:53):

hmm yeah. Then what does the "tokenizer does not allocate" claim mean?

view this post on Zulip Folkert de Vries (Feb 27 2023 at 23:58):

current main also has the list of tokens https://github.com/ziglang/zig/blob/master/lib/std/zig/Ast.zig#L21

view this post on Zulip Folkert de Vries (Feb 27 2023 at 23:59):

so I guess I got confused and it does store tokens, but not more than that

view this post on Zulip Folkert de Vries (Feb 27 2023 at 23:59):

e.g. to determine a region, you can take a token, then just tokenize/parse from that position to figure out the region

view this post on Zulip Joshua Warner (Feb 28 2023 at 00:02):

I think "tokenizer does not allocate" means the .next() method itself doesn't allocate

view this post on Zulip Joshua Warner (Feb 28 2023 at 00:02):

The caller, who is presumably appending to these two lists, obviously does allocate.

view this post on Zulip Joshua Warner (Feb 28 2023 at 00:04):

Yeah, it doesn't store the end position of each token.

view this post on Zulip Folkert de Vries (Feb 28 2023 at 00:04):

yeah, that does not seem like an interesting property to me any more (too deep into the rabbit hole). But I guess it's good to note that e.g. string literals are not copied to a new allocation

view this post on Zulip Richard Feldman (Feb 28 2023 at 00:17):

@Andrew Kelley can probably shed some light on what Zig actually does here! :smiley:

view this post on Zulip Andrew Kelley (Feb 28 2023 at 00:28):

hello

view this post on Zulip Andrew Kelley (Feb 28 2023 at 00:28):

tokenizer is here: https://github.com/ziglang/zig/blob/0.10.1/lib/std/zig/tokenizer.zig

view this post on Zulip Andrew Kelley (Feb 28 2023 at 00:30):

tokenization and parsing is done here: https://github.com/ziglang/zig/blob/0.10.1/lib/std/zig/parse.zig#L15-L78

view this post on Zulip Andrew Kelley (Feb 28 2023 at 00:30):

the "lhs"/"rhs" thing I did with the Ast is kinda bad. I recommend to use an untagged union instead of that

view this post on Zulip Andrew Kelley (Feb 28 2023 at 00:31):

example of that here: https://github.com/ziglang/zig/blob/0.10.1/src/Zir.zig

view this post on Zulip Andrew Kelley (Feb 28 2023 at 00:33):

as for the data structures of these things, after a couple years later I am happy with them. you don't need more than 5 bytes per token, and I used 13 bytes per AST node but I think you can get away with 9, because even for a binary operation, you only need to store the lhs and rhs. you can find the operator token trivially with tokenof(rhs) - 1

view this post on Zulip Andrew Kelley (Feb 28 2023 at 00:34):

one more suggestion: any computation you can do on a per-file basis without access to compiler flags or other files, you can do EXTREMELY quickly, and cache it trivially

view this post on Zulip Andrew Kelley (Feb 28 2023 at 00:36):

when your data in memory is built of a handful of arrays, you can yeet it to and from disk with a single writev/readv call (ok 1 more call to learn the lengths of the arrays in the case of reading)

view this post on Zulip Andrew Kelley (Feb 28 2023 at 00:38):

IMO it's not worth caching the results of parsing because it turns out source code is actually quite a compact way of representing an AST. however if you do any computation on top of the AST then caching could be beneficial

view this post on Zulip Joshua Warner (Feb 28 2023 at 01:08):

I used 13 bytes per AST node but I think you can get away with 9, because even for a binary operation, you only need to store the lhs and rhs

@Andrew Kelley Hmm, which 9 bytes would those be?

In the blog post linked above I see:

pub const Node = struct {
    tag: Tag,
    main_token: TokenIndex,
    data: Data,

    pub const Data = struct {
        lhs: Index,
        rhs: Index,
    };

    pub const Index = u32;
};

I'm assuming you mean removing the backing array for main_token? If so, how would you map back to the source location for things like error messages?

view this post on Zulip Andrew Kelley (Feb 28 2023 at 01:42):

having a main_token would be a fine, conservative choice. The alternative I am hinting at would look something like this:

node_tags: []Node.Tag,
node_datas: []Node.Data,
string_bytes: []const u8,
extra: []const u32,

pub const Node = struct {
    tag: Tag,
    data: Data,

    pub const Tag = enum(u8) {
        /// Pointer deref syntax (`*x`)
        /// Uses the `op_tok` union field.
        /// Token is the asterisk.
        deref,
        /// Function call syntax (`a(b, c, etc)`)
        /// Uses the `tok_payload` union field.
        /// Token is the open parenthesis.
        /// Payload points to a Call.
        call,
        /// Addition syntax (`a + b`)
        /// Uses the `bin` union field.
        add,
        /// Subtraction syntax (`a - b`)
        /// Uses the `bin` union field.
        sub,
    };

    pub const Data = union {
        op_tok: struct {
            operand: Index,
            token: TokenIndex,
        },
        tok_payload: struct {
            token: TokenIndex,
            /// Index into the extra array
            payload: u32,
        },
        bin: struct {
            lhs: Index,
            rhs: Index,
        },

        // Make sure we don't accidentally add a field to make this union
        // bigger than expected. Note that in Debug builds, Zig is allowed
        // to insert a secret field for safety checks.
        comptime {
            if (builtin.mode != .Debug and builtin.mode != .ReleaseSafe) {
                assert(@sizeOf(Data) == 8);
            }
        }

    };

    /// Use a non-exhaustive enum instead of an integer for type safety.
    /// You can give it methods for convenience, or even give it
    /// special tags (example with `none` below).
    pub const Index = enum (u32) {
        none = std.math.maxInt(u32),
        _,

        pub fn toInt(i: Index) u32 {
            assert(i != .none);
            return @enumToInt(i);
        }
    };

    pub const TokenIndex = enum (u32) { _ };

    /// Trailing is:
    /// * arg: Index for each args_len
    pub const Call = struct {
        callee: Index,
        args_len: u32,
    };
};

view this post on Zulip Andrew Kelley (Feb 28 2023 at 01:43):

for error messages all you need is a Node.Index or a TokenIndex

view this post on Zulip Andrew Kelley (Feb 28 2023 at 01:44):

computing line/column based on this information is

view this post on Zulip Andrew Kelley (Feb 28 2023 at 01:46):

with this AST encoding, it would require a helper function such as node.firstToken() to find the token based on a node. to get the index of the operator for a binary operation, you would do bin.rhs.firstToken(ast) - 1

view this post on Zulip Andrew Kelley (Feb 28 2023 at 01:47):

or maybe ast.nodeFirstToken(bin.rhs) - 1 depending on which namespace you want to put the method in

view this post on Zulip Andrew Kelley (Feb 28 2023 at 01:48):

https://github.com/ziglang/zig/blob/0.10.1/lib/std/zig/Ast.zig#L381 zig's firstToken() function

view this post on Zulip Andrew Kelley (Feb 28 2023 at 01:52):

The way I like to think about this stuff is that you are coming up with an encoding that is a form of bespoke compression for part of your compiler's state. You are, effectively, making your pipeline operate on compressed input, perform computations directly on a compressed encoding, and output different, also compressed data. The performance wins come from the fact that because the data is compressed, and yet does not ever have to be converted between uncompressed/compressed forms, the computer ends up doing less work.

view this post on Zulip Joshua Warner (Feb 28 2023 at 02:08):

Ahhh I see - so somewhere in the leaves of the tree, you always have sufficient information to accurately return firstToken directly. And all of the higher-level nodes can compute firstToken by delegating to their left child (and possibly also walking back a small number of tokens before that).

Clever!

view this post on Zulip Anton (Mar 03 2023 at 12:28):

One issue came to mind with combining parsed and canonicalized IRs; does this not conflict with our earlier plans to have an AST for the parser and one for the editor with auto-generated conversion functions?

view this post on Zulip Anton (Mar 03 2023 at 12:40):

Based on earlier messages I also want to warn against optimizing too early. In general, I believe a piece of code may be optimized after we have high confidence in its correctness because optimizing before that point will make it harder to debug. I also expect that this strategy will get us to code that is fast and reliable more quickly because less time is spent debugging.

view this post on Zulip Joshua Warner (Mar 03 2023 at 15:54):

I don't have a great understanding of the invariants that canonicalization IR guarantees - so take this with a grain of salt - but my general impression is that it has constraints that are _different enough_ from the parser AST that it's fairly non-trivial to translate.

The direction I'd like to approach the problem is to start with trying to unify (aka automate the translation between) the editor AST and the parser AST - and if it happens to work out that the result of that is usable as the canonicalization IR, great!

My reasoning here is that it seems more important to have the editor/parser ASTs be "symmetric", to reduce the number of bugs / missing features that are just on one side of that divide - than it is to have the parser/canonicalization AST be the same.


Last updated: Jun 16 2026 at 16:19 UTC