Stream: compiler development

Topic: zig compiler - tokenizer


view this post on Zulip Brendan Hansknecht (Mar 08 2025 at 06:56):

I'm just tinkering with tokenizing fuzzing bugs. Decided to pull into a separate thread to not clutter other discussions.

@Joshua Warner how would you expect a numeric base without any follow digits to tokenize?
0x. for example. Should we generate some sort of malformed node or error if the base is not followed by a number?

Currently this is one of the tokenizer fuzzer failures and I was just trying to clean it up.

view this post on Zulip Brendan Hansknecht (Mar 08 2025 at 07:02):

example failure: zig build repro-tokenize -- -b MHguMA== -v

view this post on Zulip Joshua Warner (Mar 08 2025 at 16:09):

Yeah that looks malformed to me

view this post on Zulip Luke Boswell (Mar 10 2025 at 01:13):

Found an issue from a fuzz crash. Made a small test that is failing, and investigating now. It tokenizes the ] as a CloseCurly instead of an CloseSquare.

try testTokenization(
        gpa,
        \\f{o,
        \\     ]
        \\
        \\foo =
        \\
        \\    "onmo %
    ,
        &[_]Token.Tag{
            .LowerIdent,
            .OpenCurly,
            .LowerIdent,
            .Comma,
            .Newline,
            .CloseSquare,
            .Newline,
            .LowerIdent,
            .OpAssign,
            .Newline,
            .StringStart,
            .StringPart,
        },
    );

view this post on Zulip Joshua Warner (Mar 10 2025 at 01:43):

Optimistically, I think that ought to detect the mismatched brackets, infer the intent was to have the correct closing bracket there, emit a message, and continue.

That said, it feels a bit more likely that that’s just an accident at this point.

view this post on Zulip Luke Boswell (Mar 10 2025 at 03:21):

Yeah, you're correct. I hadn't noticed the Diagnostic was being correctly reported for Mismatched braces.

view this post on Zulip Luke Boswell (Mar 10 2025 at 03:22):

I had a break for lunch, but almost finished re-wiring the way I'm dealing with Problems in the snapshots so these issues are surfaced easier.

view this post on Zulip Anthony Bullard (Mar 10 2025 at 13:39):

From your other post @Luke Boswell :

~~~PROBLEMS
check.parse.tokenize.Diagnostic.Tag.AsciiControl
check.parse.tokenize.Diagnostic.Tag.MismatchedBrace
check.parse.tokenize.Diagnostic.Tag.UnclosedString

view this post on Zulip Anthony Bullard (Mar 10 2025 at 13:39):

Looks like it's there just fine. We just don't seem to short circuit the input when we have tokenize errors

view this post on Zulip Anthony Bullard (Mar 10 2025 at 13:39):

I hope we do for parse errors

view this post on Zulip Anthony Bullard (Mar 10 2025 at 13:40):

THe invariant should be: For any fuzz input that has no tokenization or parse errors, it should have stable formatting

view this post on Zulip Anthony Bullard (Mar 10 2025 at 13:40):

And a stable AST

view this post on Zulip Brendan Hansknecht (Mar 10 2025 at 16:07):

We just call the parser directly passing in the source. It calls the tokenizer. Not sure where the tokenizer errors end up. But currently only the errors returned by parse are checked. This is where one global list of all errors that builds up with each new stage would be nice likely

view this post on Zulip Anthony Bullard (Mar 10 2025 at 21:15):

I think I agree that after a stage any stage errors should be added to a tagged union list of some sort and then the next stage can choose to proceed or not based on it being non-empty

view this post on Zulip Brendan Hansknecht (Mar 10 2025 at 21:23):

Yeah, may even switch on type of error. Cause we want to attempt to compile even with many classes of errors.

view this post on Zulip Luke Boswell (Mar 10 2025 at 21:24):

I've done this in my latest PR for review. Probably needs adjustment, but its a start.

view this post on Zulip Sam Mohr (Mar 10 2025 at 21:32):

Anthony Bullard said:

I think I agree that after a stage any stage errors should be added to a tagged union list of some sort and then the next stage can choose to proceed or not based on it being non-empty

Pretty much every kind of error should be continued from, the only places I can think of this not being true are:

view this post on Zulip Anthony Bullard (Mar 10 2025 at 22:20):

There are definitely classes of malformed nodes for example where it would not make sense to continue...

view this post on Zulip Anthony Bullard (Mar 10 2025 at 22:20):

Not to mention illegal tokens

view this post on Zulip Sam Mohr (Mar 10 2025 at 23:00):

Can you give an example of the former? Illegal tokens could make sense, but I think we'd still want to allow running the code with malformed nodes, it'd just crash quickly

view this post on Zulip Sam Mohr (Mar 10 2025 at 23:01):

It feels like these are cases where we'd now be deciding "surely the user doesn't want to run code that is malformed" for them

view this post on Zulip Sam Mohr (Mar 10 2025 at 23:02):

But they should be free to decide for themselves

view this post on Zulip Anthony Bullard (Mar 10 2025 at 23:02):

If I have a malformed expression in a list literal, how do I run that?

view this post on Zulip Sam Mohr (Mar 10 2025 at 23:02):

I crash

view this post on Zulip Sam Mohr (Mar 10 2025 at 23:03):

Just like I do for a type error

view this post on Zulip Sam Mohr (Mar 10 2025 at 23:03):

Canonicalization will generate a runtime error for that whole node

view this post on Zulip Anthony Bullard (Mar 10 2025 at 23:03):

Sure but that seems a little silly to me

view this post on Zulip Anthony Bullard (Mar 10 2025 at 23:04):

I’m using a strongly typed, compiled language to shift errors to the left

view this post on Zulip Sam Mohr (Mar 10 2025 at 23:04):

What if that function is barely called?

view this post on Zulip Sam Mohr (Mar 10 2025 at 23:04):

I think you might be missing the second paradigm of using Roc

view this post on Zulip Anthony Bullard (Mar 10 2025 at 23:04):

I’d still want to know closer to when I wrote it

view this post on Zulip Sam Mohr (Mar 10 2025 at 23:04):

The first is the one we both like and mainly care about

view this post on Zulip Anthony Bullard (Mar 10 2025 at 23:04):

Then when it just happens to get called - maybe a week later

view this post on Zulip Sam Mohr (Mar 10 2025 at 23:05):

Oh, I understand the disconnect

view this post on Zulip Sam Mohr (Mar 10 2025 at 23:05):

The runtime error would be paired with a compiler error that would give a non-zero exit code on roc build

view this post on Zulip Sam Mohr (Mar 10 2025 at 23:06):

You can still use the binary, but you'd know immediately that you have a positive number of errors and probably should fix those

view this post on Zulip Sam Mohr (Mar 10 2025 at 23:06):

During dev, you want to be able to run anyway

view this post on Zulip Sam Mohr (Mar 10 2025 at 23:06):

And 99% of people will fix the error

view this post on Zulip Sam Mohr (Mar 10 2025 at 23:07):

But doing we really want to block the 1% of people that want to run anyway?

view this post on Zulip Sam Mohr (Mar 10 2025 at 23:07):

If we decide against supporting that, then I truly have misaligned on "always inform, never block"

view this post on Zulip Richard Feldman (Mar 10 2025 at 23:08):

yeah that's idea is "always inform, never block" - report the compiler error so you know there's a problem, but always let you have the choice to run it anyway

view this post on Zulip Richard Feldman (Mar 10 2025 at 23:08):

you shift left while also simultaneously not shifting left if you don't want to. it's like Schrödinger's Shift Left

view this post on Zulip Anthony Bullard (Mar 10 2025 at 23:51):

Got it. I'm just not the target audience for that paradigm. I get enough runtime errors writing JavaScript

view this post on Zulip Richard Feldman (Mar 11 2025 at 02:14):

really simple example of where this is nice at even the lexing level: I'm trying to merge in a branch and see if it fixes what I'm working on. There are merge conflicts in a part of the code base I'm not working on, creating syntax errors with all their ======s etc.

I would still like to be able to run tests on my unrelated part of the code base, so I can tell where it's worth my time to fix those merge conflicts or whether that branch doesn't fix my problem. If the merge conflicts block me from trying it out, I have no choice but to fix them first before I know whether that'll be a waste of time!

view this post on Zulip Luke Boswell (Mar 12 2025 at 06:46):

After some digging, I found the Newline tokens don't have region information set. I think this might be the cause of challenges getting the correct offsets for AST nodes.

From what I've discovered so far, I think it's because we are using the offset field to store the indentation. I'm wondering if we could use the extra field for this instead.

pub fn pushNewline(self: *TokenizedBuffer, indent: u32) void {
    self.tokens.append(self.env.gpa, .{
        .tag = .Newline,
        .offset = indent, // store the indent in the offset field
        .extra = .{ .length = 0 },
    }) catch |err| exitOnOom(err);
}

view this post on Zulip Luke Boswell (Mar 12 2025 at 06:47):

(deleted) -- i think that example is just going to cause more confusion

view this post on Zulip Brendan Hansknecht (Mar 12 2025 at 06:50):

Yeah, instead of having an offset, newlines store the indentation of the following line

view this post on Zulip Brendan Hansknecht (Mar 12 2025 at 06:50):

So I think the parser needs to make sure not to use locations from newlines

view this post on Zulip Brendan Hansknecht (Mar 12 2025 at 06:50):

It needs to take the token before or after the newline depending on its goal

view this post on Zulip Brendan Hansknecht (Mar 12 2025 at 06:51):

But yeah, probably could move indents to the extra field instead

view this post on Zulip Brendan Hansknecht (Mar 12 2025 at 06:52):

That sounds cleaner. Especially since we have the space anyway for it.

view this post on Zulip Luke Boswell (Mar 12 2025 at 06:54):

Yeah, I'm just starting with module [] and trying to figure out why the region for that isn't what I expect.

I think the issue is that self.expect(.CloseSquare) catch { ...} calls advance which moves self.pos forward n steps (we don't know how many) until the next non-newline token. So after parsing the module header, we set the region to be the start of the next non-newline token, instead of the end of the ].

view this post on Zulip Luke Boswell (Mar 12 2025 at 06:55):

I have an idea which might fix it

view this post on Zulip Luke Boswell (Mar 12 2025 at 07:12):

Actually I found the fix, change _ = self.parseCollectionSpan to const end = self.parseCollectionSpan, and use that for the region. :octopus:

view this post on Zulip Brendan Hansknecht (Mar 13 2025 at 00:53):

@Joshua Warner how do you think we should handle something like 0u8.0 in the tokenizer?

Currently, we convert it into 111.1 which leads to it retokenizing as a float which fails. I'm debating using the extra field of the token to store the suffix. Would be an enum of none, u8, i8, etc. We could actually still store the length as well assuming we limit int literals to less than 2^28 characters.

view this post on Zulip Brendan Hansknecht (Mar 13 2025 at 00:54):

exact repro for context: zig build repro-tokenize -- -b MHU4LjA= -v

view this post on Zulip Brendan Hansknecht (Mar 13 2025 at 01:20):

Also, is .7 a valid float? I assume we should just make it a float and then reformat it. Though I could also see it mapping to a recoverable malformed node for missing the leading zero.

view this post on Zulip Luke Boswell (Mar 13 2025 at 01:48):

I think .7 should not be valid

view this post on Zulip Luke Boswell (Mar 13 2025 at 01:48):

I think .7 should not be valid

view this post on Zulip Brendan Hansknecht (Mar 13 2025 at 01:51):

I guess the question is how is it invalid. As in, if we see .7, I am 99% sure that we want the formatter to output 0.7. so simply treating it as a valid float in the compiler without any sort of malformed node may be the best bet? I guess we should still push a message. That way we can raise a warning?

view this post on Zulip Brendan Hansknecht (Mar 13 2025 at 01:51):

I guess I'm not fully sure when we want a malformed node vs a message vs etc

view this post on Zulip Richard Feldman (Mar 13 2025 at 01:59):

today .7 is a tuple accessor function, but I believe we discussed how that syntax needs to change to _.7

view this post on Zulip Richard Feldman (Mar 13 2025 at 01:59):

so if it's not a tuple accessor, it seems fine to have it be a number literal

view this post on Zulip Richard Feldman (Mar 13 2025 at 01:59):

but yeah I'd have the formatter add the 0

view this post on Zulip Brendan Hansknecht (Mar 13 2025 at 02:05):

totally forgot about tuple accessors

view this post on Zulip Joshua Warner (Mar 13 2025 at 04:02):

I would expect 0u8.0 to be: 0u8 (an int literal), then a NoSpaceDotInt token

view this post on Zulip Brendan Hansknecht (Mar 13 2025 at 05:06):

Yeah, so the issue here is technically reprinting for the fuzzer. Gets turned into 111.1.

view this post on Zulip Brendan Hansknecht (Mar 13 2025 at 05:37):

hmm, yeah, I think we need smarter number reprinting. Another example failure that is really the reprinting:
0o0.0 becomes 111.1. First parses as an int and then a NoSpaceDotInt. The second parses as a float.

view this post on Zulip Luke Boswell (Mar 14 2025 at 05:37):

How do people feel about adding a FileStart token that is always the first, in the zero'th position?

For something like a .missing_header diagnostic, it would make more sense to attach that to the whole file than to a single token. Maybe there are other errors that apply to the whole file?

Or should we special case the error in this case and if something comes up in future that we want to apply for the whole file, we revisit?

view this post on Zulip Luke Boswell (Mar 14 2025 at 05:39):

Here is a fuzz crash I was looking at. To get formatting stable, we will need to format something for these malformed nodes.

The question is what.

My first thought was just write out the original bytes.

~~~SOURCE
4o|
~~~PROBLEMS
PARSER: missing_header
PARSER: unexpected_token
~~~TOKENS
MalformedNumberBadSuffix(1:1-1:3),OpBar(1:3-1:4),EndOfFile(1:4-1:4),
~~~PARSE
(file (1:1-1:4)
    (malformed_header (1:1-1:3) "missing_header")
    (malformed_expr (1:3-1:4) "unexpected_token"))

view this post on Zulip Brendan Hansknecht (Mar 14 2025 at 05:39):

I would start by just not formatting on failures like this

view this post on Zulip Brendan Hansknecht (Mar 14 2025 at 05:40):

You have problems, so don't format at all. Just stop after parse

view this post on Zulip Luke Boswell (Mar 14 2025 at 05:40):

This is what I'm looking at... is this just a repro-parse specific thing then?

$ zig build repro-parse -- /private/tmp/corpus/default/crashes.2025-03-14-16:15:46/id:000003,sig:06,src:000007,time:57864030,execs:42508956,op:havoc,rep:4
Reading bytes for repro from /private/tmp/corpus/default/crashes.2025-03-14-16:15:46/id:000003,sig:06,src:000007,time:57864030,execs:42508956,op:havoc,rep:4
slices differ. first difference occurs at index 0 (0x0)

============ expected this output: =============  len: 2 (0x2)

[0]: check.parse.IR.Diagnostic{ .tag = check.parse.IR.Diagnostic.Tag.missing_header, .region = check.parse.IR.Region{ .start = 0, .end = 0 } }
[1]: check.parse.IR.Diagnostic{ .tag = check.parse.IR.Diagnostic.Tag.unexpected_token, .region = check.parse.IR.Region{ .start = 1, .end = 2 } }

============= instead found this: ==============  len: 0 (0x0)


================================================

view this post on Zulip Brendan Hansknecht (Mar 14 2025 at 05:40):

Long term, we do want to support formatting even with some problems, but that is likely to be a larger project (I guess you totally could take it on now if you want).

view this post on Zulip Luke Boswell (Mar 14 2025 at 05:41):

Interesting. I assumed we were already doing that because it seems the fuzzer is checking things even with errors.

view this post on Zulip Brendan Hansknecht (Mar 14 2025 at 05:42):

It's not supported to

view this post on Zulip Brendan Hansknecht (Mar 14 2025 at 05:43):

Root error starts here:

    std.testing.expectEqualSlices(IR.Diagnostic, parse_ast.errors, &[_]IR.Diagnostic{}) catch {
        return error.ParseFailed;
    };

Should bubble up through:

    const formatted = try parseAndFmt(gpa, input, debug);

Then should be ignored here:

            error.ParseFailed => {
                // No issue. Just bad input we couldn't parse.
                return;
            },

view this post on Zulip Luke Boswell (Mar 14 2025 at 05:44):

Ah I see. We're not pushing the errors into ModuleEnv yet.

view this post on Zulip Brendan Hansknecht (Mar 14 2025 at 05:45):

Oh, I don't think this checks module env errors yet, is that the issue?

view this post on Zulip Luke Boswell (Mar 14 2025 at 05:45):

Or maybe the opposite, we are pushing into the ModuleEnv but in the fuzzer it's still checking in the IR

view this post on Zulip Brendan Hansknecht (Mar 14 2025 at 05:47):

This test case 4o| is passing the fuzzer

view this post on Zulip Brendan Hansknecht (Mar 14 2025 at 05:47):

It just prints out a parse error

view this post on Zulip Brendan Hansknecht (Mar 14 2025 at 05:48):

Due to first pars failing, it exits early (and considers it a success)

view this post on Zulip Brendan Hansknecht (Mar 14 2025 at 05:48):

zig build repro-parse -- -b NG98Cg== -v && echo "all passed"

view this post on Zulip Luke Boswell (Mar 14 2025 at 05:49):

I guess that debug print had me confused. Thanks for explaining that

view this post on Zulip Luke Boswell (Mar 14 2025 at 05:49):

$ echo $?
0

view this post on Zulip Brendan Hansknecht (Mar 14 2025 at 05:49):

Maybe in verbose mode we should print out a message to clarify that on first parse failure we just exit.

view this post on Zulip Brendan Hansknecht (Mar 14 2025 at 05:50):

That or maybe add a flag, that just uses some sort of mem eql instead of the testing version that prints on error

view this post on Zulip Luke Boswell (Mar 14 2025 at 05:50):

Yeah, someone else is going to get tripped up on this too I'm sure

view this post on Zulip Luke Boswell (Mar 14 2025 at 06:13):

How should we handle malformed tokens? here is an example.
Screenshot 2025-03-14 at 17.08.52.png

That currently gives us these tokens -- there isn't anything for the NUL byte

~~~TOKENS
KwModule(1:1-1:7),OpenSquare(1:8-1:9),CloseSquare(1:9-1:10),Newline(1:1-1:1),
LowerIdent(2:1-2:4),EndOfFile(2:5-2:5),
~~~PARSE
(file (1:1-2:5)
    (module (1:1-1:10))
    (ident (2:1-2:4) "" "foo"))

view this post on Zulip Luke Boswell (Mar 14 2025 at 06:13):

Oh wait.. I think I see other MalformedXXX tokens :man_facepalming:

view this post on Zulip Brendan Hansknecht (Mar 14 2025 at 06:18):

Might just be .MalformedUnknownToken

view this post on Zulip Brendan Hansknecht (Mar 18 2025 at 16:59):

Random comment from Loris that I think is a good idea when we consider making more intensely optimized parsers via simd or only using start offset or similar:

Keep around the naive path. It makes a great fuzz oracle.

That might be a long time in the future, but something to consider for more invasive changes.

view this post on Zulip Joshua Warner (Mar 18 2025 at 18:00):

Yes!

view this post on Zulip Brendan Hansknecht (Mar 19 2025 at 04:51):

Also good read on optimizing a tokenizer in zig: https://iridescence.me/2025/03/14/elyra-part-1-optimal-tokenization/


Last updated: Jul 06 2025 at 12:14 UTC