combined parse + canonicalization IR · ideas

I was just talking with @Ayaz Hafiz about an interesting idea: if we have a separate lexer (as @Joshua Warner proposed, which seems reasonable) what if we tried to combine the parsed + canonicalized IRs into one data structure?

Richard Feldman (Feb 27 2023 at 22:15):

Richard Feldman (Feb 27 2023 at 22:16):

and it would also allow things like sorting defs in-place right after having parsed + canonicalizing them

Richard Feldman (Feb 27 2023 at 22:18):

Richard Feldman (Feb 27 2023 at 22:19):

this would mean we could go from 3 IRs to 3 IRs even after introducing lexing, since we'd go from parse -> can -> mono to lex -> parsecan -> mono

Richard Feldman (Feb 27 2023 at 22:19):

Ayaz Hafiz (Feb 27 2023 at 22:21):

we also have constraint gen so it would be more like lex -> parse -> constraint gen/ish + solve -> mono

Ayaz Hafiz (Feb 27 2023 at 22:21):

my thought is we can toss some stuff that is done in can right now that is needed for typechecking but not for analyses over the parse AST into the constraint gen pass too

Ayaz Hafiz (Feb 27 2023 at 22:22):

def-sorting is one of those but finding captures, naming anonymous lambdas, resolving alias orderings, etc. are others

Ayaz Hafiz (Feb 27 2023 at 22:22):

all of those analyses can live in look aside buffers instead of the AST, effectively just more SoA

Ayaz Hafiz (Feb 27 2023 at 22:23):

also getting rid of AST types + can types + subs types in favor for subs types out of the door would be huge

Folkert de Vries (Feb 27 2023 at 23:24):

when we're saying lexer, what do we mean exactly? does this mean materializing a sequence (say, a Vec) of tokens?

Joshua Warner (Feb 27 2023 at 23:32):

struct Tokens {
  token_kinds: Vec<TokenKind>, // a repr(u8) enum
  token_offsets: Vec<u32>, // byte offsets into the input, marking the start of each token
  comments_before_each_token: Vec<Option<Vec<CommentOrNewline>>>, // optimized out via some genric hacks, so the lexer doesn't actually produce this unless requested/required.
}

Folkert de Vries (Feb 27 2023 at 23:41):

right. I think we should investigate that zig does here. From what I understand, it does not materialize this full sequence of tokens

Folkert de Vries (Feb 27 2023 at 23:42):

rather, you have some state, and can ask for the next token. Then in many cases you can store a source position, and e.g. generating error messages can re-tokenize from that location to get the various Regions that are relevant for the message

Joshua Warner (Feb 27 2023 at 23:42):

Oh interesting; I was under the impression that it did materialize all the tokens like that. I could definitely be wrong!

Joshua Warner (Feb 27 2023 at 23:45):

If the tokens were any more complicated than a u8 + u32, the 'incremental' approach is definitely preferable.

Folkert de Vries (Feb 27 2023 at 23:48):

Joshua Warner (Feb 27 2023 at 23:50):

Folkert de Vries (Feb 27 2023 at 23:51):

I believe this refers to that the whole input is in a slice, so it is all in memory at the same time

Folkert de Vries (Feb 27 2023 at 23:51):

Joshua Warner (Feb 27 2023 at 23:52):

const Parser = struct {
    gpa: Allocator,
    source: []const u8,

    token_tags: []const Token.Tag,
    token_starts: []const Ast.ByteOffset,
    tok_i: TokenIndex,



    errors: std.ArrayListUnmanaged(AstError),
    nodes: Ast.NodeList,
    extra_data: std.ArrayListUnmanaged(Node.Index),
    scratch: std.ArrayListUnmanaged(Node.Index),
};

Joshua Warner (Feb 27 2023 at 23:52):

Folkert de Vries (Feb 27 2023 at 23:53):

Folkert de Vries (Feb 27 2023 at 23:58):

Folkert de Vries (Feb 27 2023 at 23:59):

e.g. to determine a region, you can take a token, then just tokenize/parse from that position to figure out the region

Joshua Warner (Feb 28 2023 at 00:02):

I think "tokenizer does not allocate" means the .next() method itself doesn't allocate

Joshua Warner (Feb 28 2023 at 00:02):

The caller, who is presumably appending to these two lists, obviously does allocate.

Joshua Warner (Feb 28 2023 at 00:04):

Folkert de Vries (Feb 28 2023 at 00:04):

yeah, that does not seem like an interesting property to me any more (too deep into the rabbit hole). But I guess it's good to note that e.g. string literals are not copied to a new allocation

Richard Feldman (Feb 28 2023 at 00:17):

@Andrew Kelley can probably shed some light on what Zig actually does here! :smiley:

Andrew Kelley (Feb 28 2023 at 00:28):

Andrew Kelley (Feb 28 2023 at 00:30):

the "lhs"/"rhs" thing I did with the Ast is kinda bad. I recommend to use an untagged union instead of that

Andrew Kelley (Feb 28 2023 at 00:31):

Andrew Kelley (Feb 28 2023 at 00:33):

as for the data structures of these things, after a couple years later I am happy with them. you don't need more than 5 bytes per token, and I used 13 bytes per AST node but I think you can get away with 9, because even for a binary operation, you only need to store the lhs and rhs. you can find the operator token trivially with tokenof(rhs) - 1

Andrew Kelley (Feb 28 2023 at 00:34):

one more suggestion: any computation you can do on a per-file basis without access to compiler flags or other files, you can do EXTREMELY quickly, and cache it trivially

Andrew Kelley (Feb 28 2023 at 00:36):

when your data in memory is built of a handful of arrays, you can yeet it to and from disk with a single writev/readv call (ok 1 more call to learn the lengths of the arrays in the case of reading)

Andrew Kelley (Feb 28 2023 at 00:38):

IMO it's not worth caching the results of parsing because it turns out source code is actually quite a compact way of representing an AST. however if you do any computation on top of the AST then caching could be beneficial

Joshua Warner (Feb 28 2023 at 01:08):

pub const Node = struct {
    tag: Tag,
    main_token: TokenIndex,
    data: Data,

    pub const Data = struct {
        lhs: Index,
        rhs: Index,
    };

    pub const Index = u32;
};

I'm assuming you mean removing the backing array for main_token? If so, how would you map back to the source location for things like error messages?

Andrew Kelley (Feb 28 2023 at 01:42):

having a main_token would be a fine, conservative choice. The alternative I am hinting at would look something like this:

node_tags: []Node.Tag,
node_datas: []Node.Data,
string_bytes: []const u8,
extra: []const u32,

pub const Node = struct {
    tag: Tag,
    data: Data,

    pub const Tag = enum(u8) {
        /// Pointer deref syntax (`*x`)
        /// Uses the `op_tok` union field.
        /// Token is the asterisk.
        deref,
        /// Function call syntax (`a(b, c, etc)`)
        /// Uses the `tok_payload` union field.
        /// Token is the open parenthesis.
        /// Payload points to a Call.
        call,
        /// Addition syntax (`a + b`)
        /// Uses the `bin` union field.
        add,
        /// Subtraction syntax (`a - b`)
        /// Uses the `bin` union field.
        sub,
    };

    pub const Data = union {
        op_tok: struct {
            operand: Index,
            token: TokenIndex,
        },
        tok_payload: struct {
            token: TokenIndex,
            /// Index into the extra array
            payload: u32,
        },
        bin: struct {
            lhs: Index,
            rhs: Index,
        },

        // Make sure we don't accidentally add a field to make this union
        // bigger than expected. Note that in Debug builds, Zig is allowed
        // to insert a secret field for safety checks.
        comptime {
            if (builtin.mode != .Debug and builtin.mode != .ReleaseSafe) {
                assert(@sizeOf(Data) == 8);
            }
        }

    };

    /// Use a non-exhaustive enum instead of an integer for type safety.
    /// You can give it methods for convenience, or even give it
    /// special tags (example with `none` below).
    pub const Index = enum (u32) {
        none = std.math.maxInt(u32),
        _,

        pub fn toInt(i: Index) u32 {
            assert(i != .none);
            return @enumToInt(i);
        }
    };

    pub const TokenIndex = enum (u32) { _ };

    /// Trailing is:
    /// * arg: Index for each args_len
    pub const Call = struct {
        callee: Index,
        args_len: u32,
    };
};

Andrew Kelley (Feb 28 2023 at 01:43):

Andrew Kelley (Feb 28 2023 at 01:44):

Andrew Kelley (Feb 28 2023 at 01:46):

with this AST encoding, it would require a helper function such as node.firstToken() to find the token based on a node. to get the index of the operator for a binary operation, you would do bin.rhs.firstToken(ast) - 1

Andrew Kelley (Feb 28 2023 at 01:47):

or maybe ast.nodeFirstToken(bin.rhs) - 1 depending on which namespace you want to put the method in

Andrew Kelley (Feb 28 2023 at 01:48):

Andrew Kelley (Feb 28 2023 at 01:52):

The way I like to think about this stuff is that you are coming up with an encoding that is a form of bespoke compression for part of your compiler's state. You are, effectively, making your pipeline operate on compressed input, perform computations directly on a compressed encoding, and output different, also compressed data. The performance wins come from the fact that because the data is compressed, and yet does not ever have to be converted between uncompressed/compressed forms, the computer ends up doing less work.

Joshua Warner (Feb 28 2023 at 02:08):

Ahhh I see - so somewhere in the leaves of the tree, you always have sufficient information to accurately return firstToken directly. And all of the higher-level nodes can compute firstToken by delegating to their left child (and possibly also walking back a small number of tokens before that).

Anton (Mar 03 2023 at 12:28):

One issue came to mind with combining parsed and canonicalized IRs; does this not conflict with our earlier plans to have an AST for the parser and one for the editor with auto-generated conversion functions?

Anton (Mar 03 2023 at 12:40):

Based on earlier messages I also want to warn against optimizing too early. In general, I believe a piece of code may be optimized after we have high confidence in its correctness because optimizing before that point will make it harder to debug. I also expect that this strategy will get us to code that is fast and reliable more quickly because less time is spent debugging.

Joshua Warner (Mar 03 2023 at 15:54):

I don't have a great understanding of the invariants that canonicalization IR guarantees - so take this with a grain of salt - but my general impression is that it has constraints that are _different enough_ from the parser AST that it's fairly non-trivial to translate.

The direction I'd like to approach the problem is to start with trying to unify (aka automate the translation between) the editor AST and the parser AST - and if it happens to work out that the result of that is usable as the canonicalization IR, great!

My reasoning here is that it seems more important to have the editor/parser ASTs be "symmetric", to reduce the number of bugs / missing features that are just on one side of that divide - than it is to have the parser/canonicalization AST be the same.

Stream: ideas

Topic: combined parse + canonicalization IR

Richard Feldman (Feb 27 2023 at 22:15):

Richard Feldman (Feb 27 2023 at 22:15):

Richard Feldman (Feb 27 2023 at 22:16):

Richard Feldman (Feb 27 2023 at 22:18):

Richard Feldman (Feb 27 2023 at 22:19):

Richard Feldman (Feb 27 2023 at 22:19):

Ayaz Hafiz (Feb 27 2023 at 22:21):

Ayaz Hafiz (Feb 27 2023 at 22:21):

Ayaz Hafiz (Feb 27 2023 at 22:22):

Ayaz Hafiz (Feb 27 2023 at 22:22):

Ayaz Hafiz (Feb 27 2023 at 22:23):

Folkert de Vries (Feb 27 2023 at 23:24):

Joshua Warner (Feb 27 2023 at 23:32):

Folkert de Vries (Feb 27 2023 at 23:41):

Folkert de Vries (Feb 27 2023 at 23:42):

Joshua Warner (Feb 27 2023 at 23:42):

Joshua Warner (Feb 27 2023 at 23:45):

Folkert de Vries (Feb 27 2023 at 23:48):

Joshua Warner (Feb 27 2023 at 23:50):

Folkert de Vries (Feb 27 2023 at 23:51):

Folkert de Vries (Feb 27 2023 at 23:51):

Joshua Warner (Feb 27 2023 at 23:52):

Joshua Warner (Feb 27 2023 at 23:52):

Joshua Warner (Feb 27 2023 at 23:52):

Folkert de Vries (Feb 27 2023 at 23:53):

Folkert de Vries (Feb 27 2023 at 23:58):

Folkert de Vries (Feb 27 2023 at 23:59):

Folkert de Vries (Feb 27 2023 at 23:59):

Joshua Warner (Feb 28 2023 at 00:02):

Joshua Warner (Feb 28 2023 at 00:02):

Joshua Warner (Feb 28 2023 at 00:04):

Folkert de Vries (Feb 28 2023 at 00:04):

Richard Feldman (Feb 28 2023 at 00:17):

Andrew Kelley (Feb 28 2023 at 00:28):

Andrew Kelley (Feb 28 2023 at 00:28):

Andrew Kelley (Feb 28 2023 at 00:30):

Andrew Kelley (Feb 28 2023 at 00:30):

Andrew Kelley (Feb 28 2023 at 00:31):

Andrew Kelley (Feb 28 2023 at 00:33):

Andrew Kelley (Feb 28 2023 at 00:34):

Andrew Kelley (Feb 28 2023 at 00:36):

Andrew Kelley (Feb 28 2023 at 00:38):

Joshua Warner (Feb 28 2023 at 01:08):

Andrew Kelley (Feb 28 2023 at 01:42):

Andrew Kelley (Feb 28 2023 at 01:43):

Andrew Kelley (Feb 28 2023 at 01:44):

Andrew Kelley (Feb 28 2023 at 01:46):

Andrew Kelley (Feb 28 2023 at 01:47):

Andrew Kelley (Feb 28 2023 at 01:48):

Andrew Kelley (Feb 28 2023 at 01:52):

Joshua Warner (Feb 28 2023 at 02:08):

Anton (Mar 03 2023 at 12:28):

Anton (Mar 03 2023 at 12:40):

Joshua Warner (Mar 03 2023 at 15:54):