Stream: compiler development

Topic: zig compiler - IR serde


view this post on Zulip Norbert Hajagos (Feb 16 2025 at 09:08):

Hi! I'm starting on function specialization. I'd like to create a test that has the correct input data in it. Since I'm at the veeery beginning of learning, can someone give me an example how I could build up an IR in zig manually (with module Env all that)? Just for example this code:

increment = |num| num + 1

main = || {
    x = 2
    increment(x)
}

I will be available after the usual Sunday family dinner :)

view this post on Zulip Sam Mohr (Feb 16 2025 at 09:25):

The best approach seems to be creating an S-expr parser and printer for each IR

view this post on Zulip Sam Mohr (Feb 16 2025 at 09:29):

We could manually create nodes, but that's probably more trouble than it's worth

view this post on Zulip Sam Mohr (Feb 16 2025 at 09:33):

If you really want to go that way, let me know and I can make a branch with an example

view this post on Zulip Norbert Hajagos (Feb 16 2025 at 13:39):

I agree a serializable format would be better. I guess there's no need for me to rush straight to tackling specialization, if there are things that would make it easier.

view this post on Zulip Brendan Hansknecht (Feb 16 2025 at 16:10):

Any chance we can share one printer/parser?

view this post on Zulip Brendan Hansknecht (Feb 16 2025 at 16:11):

Are the IRs similar enough?

view this post on Zulip Norbert Hajagos (Feb 16 2025 at 16:13):

From a quick look, yes, they seem pretty similar. At least from what I have seen, in function lift and function solve IR. But after reading Ayas' document, they should be the same as they are just more and more simplified versions of one-another.

view this post on Zulip Brendan Hansknecht (Feb 16 2025 at 16:17):

I more meant in datastructure form such that a tiny bit of zig comptime could deal with any of them for printing and parsing. And I know they all should have similar datastrtutures. Just not sure if they are close enough in practice.

view this post on Zulip Norbert Hajagos (Feb 16 2025 at 16:23):

You mean creating a general IRPrinter that can take any of the IR's and spit out the correct strings? (Same for the parse)

view this post on Zulip Brendan Hansknecht (Feb 16 2025 at 16:23):

Yeah

view this post on Zulip Brendan Hansknecht (Feb 16 2025 at 16:25):

Given we have so many IRs and they are all laid out very similar, sounds doable

view this post on Zulip Norbert Hajagos (Feb 16 2025 at 16:29):

I see. Would be cool and should strive for it I think! I just started to work on a printer. I'm still in the Zig phase of "What.. a file is a struct?", so I don't know when that will be usable. Right now my plan is to create 'serialize' functions on all the base.zig types. They would take a std.io.AnyWriter as argument. Once I can print those, I can widen my horizon.

view this post on Zulip Brendan Hansknecht (Feb 16 2025 at 16:48):

Sure

view this post on Zulip Brendan Hansknecht (Feb 16 2025 at 16:48):

Exploration is good right now in general

view this post on Zulip Joshua Warner (Feb 16 2025 at 17:45):

My thinking was to create a generic sexpr parser, and have each IR be constructable from that sexpr representation. Similarly for printing - irs only need to know how to output sexpr nodes.

view this post on Zulip Norbert Hajagos (Feb 16 2025 at 17:55):

Do I understand it correctly that Sexpr would be an Intermediate representation for all the intermadiate representations during serialization / deserialization :big_smile: ? So we would go from {check_IR, lift_IR, ...}->Sexpr->String and similarly backwards

view this post on Zulip Notification Bot (Feb 16 2025 at 17:58):

This topic was moved here from #compiler development > help creating a basic IR by Norbert Hajagos.

view this post on Zulip Joshua Warner (Feb 16 2025 at 18:01):

Yep, that was my thought

view this post on Zulip Norbert Hajagos (Feb 16 2025 at 18:01):

I like that

view this post on Zulip Joshua Warner (Feb 16 2025 at 18:01):

That'll let us have one parser and one pretty-printer

view this post on Zulip Isaac Van Doren (Feb 16 2025 at 19:09):

How do you think a common printer and parser would be implemented in Zig? I don’t have a clear idea how that would work

view this post on Zulip Brendan Hansknecht (Feb 16 2025 at 19:13):

Sounds like they were thinking just making a generic intermediate IR. Then having everything go to/from that data structure.

view this post on Zulip Brendan Hansknecht (Feb 16 2025 at 19:15):

Otherwise, you can do interfaces by doing what allocators do with essentially a vtable. Could make one that every ir implements that is used for printing/parsing

view this post on Zulip Brendan Hansknecht (Feb 16 2025 at 19:15):

Lastly, by making comptime interfaces, you essentially have traits from rust

view this post on Zulip Brendan Hansknecht (Feb 16 2025 at 19:57):

When we are spinning this up, if anyone needs help with zig interfaces and what not, let me know. I think if you design the interface as you see fit ignoring zig, I can help translate to zig.

view this post on Zulip Norbert Hajagos (Feb 16 2025 at 20:03):

I'll be cooking up something then and put progress updates in this thread.

view this post on Zulip Norbert Hajagos (Feb 16 2025 at 20:25):

I already have a zig question :big_smile:
The type of base.Literal.String is of collections.interner.StringLiteral
I wanted to experiment with printing each kind of literal, but the StringLiteral type doesn't seem to hold any value. It's only a namespace for the Idx and the Interner type. I undestand that with the combination of an interner and an index, i can query the string value, but does the current representation of base.Literal.String doesn't have either, so it holds 0 information.

Should the base.Literal.String be of type collections.interner.StringLiteral.Idx, or am I missing something?

view this post on Zulip Luke Boswell (Feb 16 2025 at 20:30):

The IRs in those later build stages are mostly a placeholder. It's possible it just doesn't make sense. Im afk rn but can help poke at it in a couple of hours.

view this post on Zulip Norbert Hajagos (Feb 16 2025 at 20:33):

Thanks, but in a couple of hours, I'll already be sleeping :smile:. Think I'll just change it to Idx, see how things will puzzle out.

view this post on Zulip Sam Mohr (Feb 16 2025 at 20:34):

Yes, it should be Idx

view this post on Zulip Sam Mohr (Feb 16 2025 at 20:35):

The lazy compilation thing has been pretty tricky for starting this project, though I'm sure it'll not be as annoying in a couple weeks

view this post on Zulip Norbert Hajagos (Feb 16 2025 at 20:37):

It feels so good to slowly get a hold on what's happening!

You mean the modules not being connected and thus not checked?

view this post on Zulip Sam Mohr (Feb 16 2025 at 20:40):

Yep

view this post on Zulip Norbert Hajagos (Feb 16 2025 at 22:41):

As I was exploring, I stumbled upon a memory leak in the string interner. Fixed it on my branch (testing allocator is so good, just running zig build test does the job).
But... it just feels wrong to loop through strings and free them one-by-one. After all these times of hearing about them, I'm pretty sure this is where I'd need an arena. I tried to wrap the interner's allocator in one, but it kept leaking anyways, so I'm asking if any one of you could guide me here.

Faild attempt at using arena:

Was a fun weekend on the compiler, tomorrow I'll catch up. Bye!

view this post on Zulip Richard Feldman (Feb 16 2025 at 22:48):

yeah I think we need arenas for everything, including string interning, so that we can serialize them :big_smile:

view this post on Zulip Richard Feldman (Feb 16 2025 at 22:48):

I guess except like...scratch allocations that aren't going to end up being serialized maybe

view this post on Zulip Sam Mohr (Feb 16 2025 at 22:48):

Yep

view this post on Zulip Sam Mohr (Feb 16 2025 at 22:49):

I think having some naming pattern around scratch_arena vs. just arena could help there

view this post on Zulip Sam Mohr (Feb 16 2025 at 22:49):

Since Zig seems to lean into the gpa meaning "use a general-purpose allocator here" naming convention

view this post on Zulip Richard Feldman (Feb 16 2025 at 22:50):

yeah I'm not sure if we want gpa or not for those

view this post on Zulip Richard Feldman (Feb 16 2025 at 22:50):

@Andrew Kelley pointed out that bump arenas are not conducive to arraylists that need to resize because they're super unlikely to be able to resize in-place

view this post on Zulip Richard Feldman (Feb 16 2025 at 22:50):

whereas I guess with a gpa it's more likely

view this post on Zulip Richard Feldman (Feb 16 2025 at 22:51):

I'm not sure how to think about the tradeoffs there :sweat_smile:

view this post on Zulip Brendan Hansknecht (Feb 16 2025 at 23:38):

We want gpa or c allocator for all the arraylists

view this post on Zulip Brendan Hansknecht (Feb 16 2025 at 23:39):

We want arenas for basically everything else

view this post on Zulip Brendan Hansknecht (Feb 16 2025 at 23:39):

For arraylists, the amortization of growing by 1.5 to 2x and reasonable capacity defaults will make it cheap

view this post on Zulip Brendan Hansknecht (Feb 16 2025 at 23:40):

Likely arenas we just have to be careful minorly of lifetimes.

view this post on Zulip Brendan Hansknecht (Feb 16 2025 at 23:42):

Yeah, and I think we should have a pattern of all allocators being named arena or gpa. I guess we could get more fine grain, but that is likely enough by context of the datastructure. Like the parsing arena won't out live the parsing IR.

view this post on Zulip Richard Feldman (Feb 16 2025 at 23:44):

well everything that's going to be serialized needs to use the arena

view this post on Zulip Sam Mohr (Feb 16 2025 at 23:46):

There will be one arena for that which CanIR will own, and it'll get provided to each module's ModuleEnv as their allocator. We should call the allocator for ModuleEnv "arena" to make that more obvious

view this post on Zulip Sam Mohr (Feb 16 2025 at 23:47):

And by CanIR owning the arena, then the serialization work just falls out of that for free

view this post on Zulip Brendan Hansknecht (Feb 16 2025 at 23:48):

Richard Feldman said:

well everything that's going to be serialized needs to use the arena

:thinking:??? Why?

view this post on Zulip Sam Mohr (Feb 16 2025 at 23:48):

So parse will intern data into the CanIR arena but put its IR into a different allocator's memory, canonicalize will put its intern data (e.g. generated idents created during desugaring) into the arena

view this post on Zulip Sam Mohr (Feb 16 2025 at 23:49):

If everything is in a single arena, then we get (de)serialization for free

view this post on Zulip Brendan Hansknecht (Feb 16 2025 at 23:49):

They just need to use indices instead or pointers

view this post on Zulip Sam Mohr (Feb 16 2025 at 23:50):

So if there's a single data structure, CanIR, in the arena, which includes the ModuleEnv, then we can just treat that arena as binary data that goes into/out of the cache

view this post on Zulip Brendan Hansknecht (Feb 16 2025 at 23:51):

The ir needs to be in a gpa or you will hit big pains on growth. On top of that, if your arena has tons of gaps, you don't want to serialize it.

view this post on Zulip Sam Mohr (Feb 16 2025 at 23:52):

IR for everything besides canonicalization would be in a GPA

view this post on Zulip Brendan Hansknecht (Feb 16 2025 at 23:52):

I think it is better to think of serialization as writing out a struct containing each of the core arrays of data. Each array only uses indices and not pointers and thus can just be copied to disk.

view this post on Zulip Brendan Hansknecht (Feb 16 2025 at 23:52):

I would not consider serializing arenas. They will have tons of gaps and garbage data from intermediates.

view this post on Zulip Brendan Hansknecht (Feb 16 2025 at 23:53):

Also, references to data in an arena are still pointers (thus not safe to just serialize)

view this post on Zulip Sam Mohr (Feb 16 2025 at 23:53):

There should be no references in the arena, I agree

view this post on Zulip Sam Mohr (Feb 16 2025 at 23:54):

All intermediates besides arraylist's backing arrays should be created within other allocators

view this post on Zulip Sam Mohr (Feb 16 2025 at 23:54):

So a perf question:

view this post on Zulip Sam Mohr (Feb 16 2025 at 23:55):

Which would be faster: creating an mmaped arena that we don't have to serialize, or using a gpa to get a more compact memory backing for the serialized data and then manually copying to/from the cache?

view this post on Zulip Sam Mohr (Feb 16 2025 at 23:55):

Because I don't think we can just mmap a file in the second case

view this post on Zulip Sam Mohr (Feb 16 2025 at 23:56):

But if it's similar perf, then maybe the arena-based approach isn't worth the effort

view this post on Zulip Brendan Hansknecht (Feb 16 2025 at 23:56):

(deleted)

view this post on Zulip Brendan Hansknecht (Feb 16 2025 at 23:56):

Wrong button clicked...

view this post on Zulip Sam Mohr (Feb 16 2025 at 23:56):

Since it would create a larger cache artifact to use the arena, as you say

view this post on Zulip Brendan Hansknecht (Feb 16 2025 at 23:56):

Meant to edit not delete

view this post on Zulip Brendan Hansknecht (Feb 16 2025 at 23:59):

How I see it working.

  1. Allocate with gpa for growing things and arenas for everything else. Everything should use indices not pointers.
  2. Writing out to cache will require moving data to disk and writing everything out flat. Should just be a couple of really big mem copies and quite cheap
  3. When reloading, growth will not be needed, so can just mmap and then get slices to the correct data for each array.

view this post on Zulip Sam Mohr (Feb 17 2025 at 00:00):

I for sure agree if this approach is about as performant as Richard's idea

view this post on Zulip Sam Mohr (Feb 17 2025 at 00:00):

I've been presuming Richard's way must be faster for us to consider it, but that may very well be wrong

view this post on Zulip Richard Feldman (Feb 17 2025 at 00:02):

so briefly, the whole goal of the serialization is that reading a cache into memory is 2 syscalls: open the file, and then read it into memory. That's it, there's no parsing step whatsoever - once it's in memory, it is already ready to use normally

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 00:03):

Yes

view this post on Zulip Richard Feldman (Feb 17 2025 at 00:04):

I'm not sure how that works outside of an arena :sweat_smile:

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 00:05):

Well, you may want to do some minor range processing to get zig slices instead of raw offsets for some things, but that is derived from the raw mmap

view this post on Zulip Richard Feldman (Feb 17 2025 at 00:06):

the thing is, in bigger projects (where compile times are noticeable) reading from cache is going to be overwhelmingly how things get loaded into memory

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 00:06):

I think you think arenas solve an issue they don't

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 00:07):

An arena is an linked list of chunks of memory allocated from another allocator

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 00:07):

Chunks are large and cheap to allocate.

view this post on Zulip Richard Feldman (Feb 17 2025 at 00:07):

oh 100% this doesn't work if the arena is backed by linked list haha

view this post on Zulip Richard Feldman (Feb 17 2025 at 00:07):

has to be big virtual allocations

view this post on Zulip Richard Feldman (Feb 17 2025 at 00:07):

that reallocate and copy if necessary (like an arraylist does), which is fine if everything is indices

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 00:08):

I guess what I'm trying to say is separate initial construction from cache form.

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 00:08):

Initial construction will be fastest and easiest to do with large gpa backed arraylists

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 00:09):

After done, you have a set of slices all using indicies and no pointers.

view this post on Zulip Richard Feldman (Feb 17 2025 at 00:09):

a critical part of the "2 syscalls" idea is that the very beginning of the allocation (as in, index zero) has to be the header which holds all the other arraylists (and then from there everything can be indices into those)

view this post on Zulip Richard Feldman (Feb 17 2025 at 00:09):

hm, but I guess writev can guarantee that actually :thinking:

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 00:10):

Yes, when serializing, you do so flat (header of slice info) then all those slices constructed above with the gpa.

You serialize flat into a static sized file. This can be loaded with the two syscall approach

view this post on Zulip Richard Feldman (Feb 17 2025 at 00:10):

ok separate question though: if we're running this persistently, and we've loaded something from cache, we explicitly don't want to dealloc everything individually

view this post on Zulip Richard Feldman (Feb 17 2025 at 00:10):

whereas we must do that if it's long-running (e.g. for language server)

view this post on Zulip Richard Feldman (Feb 17 2025 at 00:10):

I guess we can just have a flag of like

view this post on Zulip Richard Feldman (Feb 17 2025 at 00:11):

"this was loaded from cache, so dealloc it all at once"

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 00:11):

You can dealloc at the module level. Each module will have its own mmap or allocators

view this post on Zulip Richard Feldman (Feb 17 2025 at 00:11):

yeah that seems reasonable!

view this post on Zulip Richard Feldman (Feb 17 2025 at 00:12):

separately, I think we'll want to do some giant virtual alloc up front and then read all the cache entries into that

view this post on Zulip Richard Feldman (Feb 17 2025 at 00:13):

well, that's in a batch build I guess

view this post on Zulip Richard Feldman (Feb 17 2025 at 00:13):

prob not language server bc we might need to dealloc those

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 00:21):

I don't think it will make a difference

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 00:23):

MMaps are lazily loaded and page fault to load data. Force loading them into one big allocation would force being everything to be copied an extra time.

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 00:23):

It might be beneficial to load everything into CPU cache early, but if so, we can just force non-lazy loading of the mmaps

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 00:25):

And due to everything being handled at the page level anyway, doesn't really matter if we have "scattered" mmap pages or "linear" mmap pages. Both are totally up to the os virtual to physical page map to where they will actually be in memory

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 00:31):

As a note, the string intern probably needs to be two gpa backed arraylists. One of starts and lengths for the interned strings. And one flat U8 arraylists of string content. That will make serialization easy.

Aside, does string interning do deduplication if so, it probably also needs an hashmap of string to index it is inserted at. No need to serialize the hashmap cause the string intern will be finalized when serialized to disk.

view this post on Zulip Richard Feldman (Feb 17 2025 at 00:38):

ah interesting, I hadn't been thinking of mmap for these

view this post on Zulip Richard Feldman (Feb 17 2025 at 00:38):

that would change things :thinking:

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 01:06):

Oh, I thought the point of making a flat serializable datastructure was to enable mmaping them

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 01:06):

Oh, you mean the string interner specifically?

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 01:07):

We would either need to serialize the interner or would need to reference the original source to reload the strings

view this post on Zulip Sam Mohr (Feb 17 2025 at 01:08):

I definitely figured we'd be serializing the interner, which would "own" the strings within the arena instead of referencing the source file or something

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 01:09):

Yeah, I agree. Makes sense to do so when serializing can ir.

view this post on Zulip Richard Feldman (Feb 17 2025 at 01:11):

Brendan Hansknecht said:

  1. Writing out to cache will require moving data to disk and writing everything out flat. Should just be a couple of really big mem copies and quite cheap

btw this can be even cheaper than that! writev can let you skip the memcpys, and there's an equivalent on windows

view this post on Zulip Richard Feldman (Feb 17 2025 at 01:12):

Brendan Hansknecht said:

Oh, I thought the point of making a flat serializable datastructure was to enable mmaping them

I was just thinking of doing one big read on the whole thing

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 01:12):

Yeah, that sounds fine too (can test both and see which is faster, minor detail on how the data is loaded)

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 01:13):

Richard Feldman said:

Brendan Hansknecht said:

  1. Writing out to cache will require moving data to disk and writing everything out flat. Should just be a couple of really big mem copies and quite cheap

btw this can be even cheaper than that! writev can let you skip the memcpys, and there's an equivalent on windows

Yeah, I said memcpy, but I really meant a copy to disk. Which would be writeev or whatever system call.

view this post on Zulip Richard Feldman (Feb 17 2025 at 01:13):

mmap on anything not in a tempdir always makes me nervous that some other process is going to delete the file while the mmap is active and cause chaos

view this post on Zulip Richard Feldman (Feb 17 2025 at 01:13):

although wait, didn't you say there's a setting that can prevent that? :thinking:

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 01:14):

Someone deletes the roc cache while we are compiling

view this post on Zulip Richard Feldman (Feb 17 2025 at 01:14):

yeah

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 01:15):

Yeah, if so, reading is fine. Then less chance for failure at the cost of eagerly loading everything (which could actually be a perf gain, really depends on usage pattern). Could be worse for memory use, but maybe not in any sort of meaningful way.

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 01:16):

But yeah, read or mmap or etc all sound fine and we can figure out the tradeoffs later.

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 01:19):

Sounds like mmap counts as an extra reference count on a file and it won't actually be deleted until the mmaps is closed (on Linux)

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 01:19):

https://stackoverflow.com/questions/42961339/mmap-after-deleting-the-file#50376124

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 01:20):

So deleting it will remove it from the file tree, but the lnode wont be deleted until the file is until it is safe

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 01:20):

Would definitely want to test that behaviour to make sure or look it up in docs somewhere

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 01:22):

Oh, though mutation on an mmaped file still may affect things, but if a user is mutating cache files, that is their own problem

view this post on Zulip Richard Feldman (Feb 17 2025 at 01:23):

yeah I agree

view this post on Zulip Richard Feldman (Feb 17 2025 at 01:23):

mutating cache files is UB anyway

view this post on Zulip Richard Feldman (Feb 17 2025 at 01:23):

regardless of when you do it

view this post on Zulip Richard Feldman (Feb 17 2025 at 01:24):

(in this case)

view this post on Zulip Richard Feldman (Feb 17 2025 at 01:24):

but deleting cache files should always be safe

view this post on Zulip Richard Feldman (Feb 17 2025 at 01:24):

so that sounds like a great option to me!

view this post on Zulip Richard Feldman (Feb 17 2025 at 01:25):

in that design the 2 syscalls are open and mmap, and then afterwards we can munmap if we're a language server and want to selectively clean up

view this post on Zulip Richard Feldman (Feb 17 2025 at 01:25):

this sounds awesome :smiley:

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 01:28):

Yep

view this post on Zulip Andrew Kelley (Feb 17 2025 at 03:29):

Brendan Hansknecht said:

Likely arenas we just have to be careful minorly of lifetimes.

ArrayList didn't grow this debug feature yet but some of the data structures such as hash maps have "pointer locks", so if you want to keep element pointers alive for a nontrivial amount of function call graph, you can do something like this:

map.lockPointers();
defer map.unlockPointers();

and it will crash if you called any of the non "assume capacity" methods. i.e. you get a crash instead of potential Use-After-Free

in release fast mode those are no-ops.

view this post on Zulip Andrew Kelley (Feb 17 2025 at 03:31):

Richard Feldman said:

I'm not sure how that works outside of an arena :sweat_smile:

like this

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 03:39):

Thanks

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 03:39):

Yeah, that matches what I expected

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 03:39):

Also, cool debug feature

view this post on Zulip Norbert Hajagos (Feb 17 2025 at 10:15):

Brendan Hansknecht said:

As a note, the string intern probably needs to be two gpa backed arraylists. One of starts and lengths for the interned strings. And one flat U8 arraylists of string content. That will make serialization easy.

Aside, does string interning do deduplication if so, it probably also needs an hashmap of string to index it is inserted at. No need to serialize the hashmap cause the string intern will be finalized when serialized to disk.

This is the comment above the StringLiteral.Interner:

/// As opposed to the [IdentName.Interner], this interner does not
/// deduplicate its values because they are expected to be almost all unique,
/// and also larger, meaning more expensive to do equality checking on.

I also thought of something similar, since the strings won't be changing in the buffer. How do you feel about a design like this?

// just after inicialization
expect interner.buff == ""
expect interner.start_positions == [0]

//inserting 2 strings
const hi_idx = interner.insert("hi"); // 0
const there_idx = interner.insert("there"); // 1

//results with
expect interner.buff == "hithere"
expect interner.start_positions == [0, 2, 7]

// no need to store start and length, since the next value in start_positions holds where the string ends
expect interner.get(there_idx) == interner.buff[interner.start_positions[there_idx]..interner.start_positions[there_idx +1]]

Would be happy to implement it. What type should the start_positions arraylist hold? If only string literals from source code will be in the interner, u32 would be enough. Source files with > 4Gb str literals aren't realistic imo. However, if we stored embedded files in these as well (think import "some-file" as some_str : Str) then I'd use a usize just in case. I'm easily convinced it's an overkill though.

view this post on Zulip Andrew Kelley (Feb 17 2025 at 10:48):

null termination is viable as well. with short strings the cpu cache savings can be worth it

view this post on Zulip Joshua Warner (Feb 17 2025 at 11:18):

Or a previous strategy I’ve seen, of prepending the string with a varint length, encoded backwards, and placed before the offset you return.

view this post on Zulip Norbert Hajagos (Feb 17 2025 at 12:23):

Cool! Could do either. @Joshua Warner do you think storing the length that way would be beneficial over having null terminated strings?

view this post on Zulip Joshua Warner (Feb 17 2025 at 15:40):

The advantage would be in making computing the length of strings >= ~64 bytes (a cache line) faster

view this post on Zulip Joshua Warner (Feb 17 2025 at 15:41):

(that said, I think it's unlikely we'll have many identifiers >= 64 bytes, so :shrug: )

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 15:52):

This is the string literal Interner. We will have some large ones

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 16:02):

Norbert Hajagos said:

Would be happy to implement it. What type should the start_positions arraylist hold? If only string literals from source code will be in the interner, u32 would be enough. Source files with > 4Gb str literals aren't realistic imo. However, if we stored embedded files in these as well (think import "some-file" as some_str : Str) then I'd use a usize just in case. I'm easily convinced it's an overkill though.

I think we should handle imported sources special and just never put them in interners. That would be a waste. And since the interners are per module, the strings from a single module can never be over 4GB.

view this post on Zulip Richard Feldman (Feb 17 2025 at 17:35):

should literals be interned at all? I don't see the benefit honestly :sweat_smile:

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 17:36):

Only benefit is that you can close files....so maybe not worth it.

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 17:36):

Might be better to just reference back to the original file

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 17:37):

I guess also maybe minor cache gains later in the compiler

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 17:38):

But yeah, I think the only real gain is if it saves a lot of memory in large roc projects due to being able to free files.

view this post on Zulip Joshua Warner (Feb 17 2025 at 18:14):

Presumably we would want to serialize that data with the can IR (or whatever else we're caching)

view this post on Zulip Joshua Warner (Feb 17 2025 at 18:15):

That way we only have to read the can IR and not the original file (e.g. supposing we're confident it hasn't changed - and assuming we didn't need to hash it to figure that out)

view this post on Zulip Richard Feldman (Feb 17 2025 at 18:17):

we already have to read the original file into memory in order to hash it, because the hash is the key for the cached can IR

view this post on Zulip Richard Feldman (Feb 17 2025 at 18:17):

and the can IR is a hash of the source bytes, so we know all the offsets into the source file will still be valid! :smiley:

view this post on Zulip Joshua Warner (Feb 17 2025 at 18:17):

IMO we ought to skip hashing most of the time

view this post on Zulip Joshua Warner (Feb 17 2025 at 18:18):

That sort of thing can really matter for large projects

view this post on Zulip Joshua Warner (Feb 17 2025 at 18:18):

(this is exactly what git does - it doesn't rehash all files in your git dir every time you commit - only the ones that have a different stat)

view this post on Zulip Joshua Warner (Feb 17 2025 at 18:20):

If we can only read one file (the can ir), and not seek the disk to read the original file again, that's a substantial speed boost

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 19:19):

Note: this is the kind of thing that is super easy to change later

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 19:19):

So either way we go should be fine.

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 19:20):

But yeah, especially on a HDD, reading only one could be big gains

view this post on Zulip Richard Feldman (Feb 17 2025 at 19:41):

I'm not sure how we could skip it :thinking:

view this post on Zulip Richard Feldman (Feb 17 2025 at 19:42):

like if the CLI receives an instruction to check main.roc, how would it know what cached IR to load without reading and hashing the source bytes?

view this post on Zulip Richard Feldman (Feb 17 2025 at 19:42):

(it's different if it's a persistent process that's watching the filesystem for changes, of course)

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 20:09):

Oh yeah, the hash of the file is the key to the cache.

view this post on Zulip Brendan Hansknecht (Feb 17 2025 at 20:09):

So hashing is required unless we come up with a different key

view this post on Zulip Andrew Kelley (Feb 18 2025 at 02:19):

Joshua Warner said:

Or a previous strategy I’ve seen, of prepending the string with a varint length, encoded backwards, and placed before the offset you return.

This never occured to me. Cool idea!

view this post on Zulip Andrew Kelley (Feb 18 2025 at 02:20):

Richard Feldman said:

should literals be interned at all? I don't see the benefit honestly :sweat_smile:

In zig we do it because it helps simplify our memory model, particularly with serialization

view this post on Zulip Andrew Kelley (Feb 18 2025 at 02:21):

I must be missing something because I don't see the connection with files

view this post on Zulip Richard Feldman (Feb 18 2025 at 02:24):

if the literal doesn't have any escapes in it, you can just store the region in the original source file
because those are the exact bytes you want anyway

view this post on Zulip Andrew Kelley (Feb 18 2025 at 02:24):

Richard Feldman said:

like if the CLI receives an instruction to check main.roc, how would it know what cached IR to load without reading and hashing the source bytes?

Have a cache file per source file path, and store the size, mtime, inode. If any of those change, cache miss. We do this in zig for source -> ZIR

view this post on Zulip Andrew Kelley (Feb 18 2025 at 02:25):

Oh, we don't have source files, tokens, or ast nodes loaded at all in the happy path

view this post on Zulip Richard Feldman (Feb 18 2025 at 02:26):

we could do the size/mtime/inode/etc as an extra layer, but if any of those change we still have to redo the hash computation

view this post on Zulip Richard Feldman (Feb 18 2025 at 02:28):

so the tradeoff for the cache hit scenario would be:

in the cache miss scenario, we might discover that we need to re-hash sooner...but then the first thing we do in a cache miss is hash anyway, to know what to name the new cache file :big_smile:

view this post on Zulip Richard Feldman (Feb 18 2025 at 02:28):

so I guess another way to say it is that we re-hash whether there's going to be a cache miss or cache hit, so might as well just always do it, yeah?

view this post on Zulip Andrew Kelley (Feb 18 2025 at 02:29):

I think you should be able to avoid rehashing when you have a hit

view this post on Zulip Andrew Kelley (Feb 18 2025 at 02:30):

In our case it would be extremely wasteful since we don't even need access to the file contents on a cache hit

view this post on Zulip Andrew Kelley (Feb 18 2025 at 02:31):

Maybe I'm not understanding what you mean when you say "hash". Are you talking about the file contents?

view this post on Zulip Andrew Kelley (Feb 18 2025 at 02:38):

I vaguely recall something about assigning a unique integer per identifier across all source files or something like that. I'm guessing that's what you mean by "hash"

view this post on Zulip Richard Feldman (Feb 18 2025 at 02:40):

by hash I specifically mean a BLAKE3 hash of the entire contents of the .roc source file

view this post on Zulip Richard Feldman (Feb 18 2025 at 02:41):

:thinking: how do we know we have a hit without rehashing though?

view this post on Zulip Andrew Kelley (Feb 18 2025 at 02:54):

(mtime, inode, size) is generally considered to be reliable enough. As a bonus you can try to detect mtime granularity and avoid false positives in case the file system has high granularity (i.e. NFS, which nobody should be using)

view this post on Zulip Andrew Kelley (Feb 18 2025 at 02:55):

fstat cache hits are going to be a lot faster than hashing cache hits. Especially if you don't need to load the file contents on hit

view this post on Zulip Sam Mohr (Feb 18 2025 at 02:57):

If we intern string literals, then I'm pretty sure we wouldn't need anything from the source file, just the cache contents

view this post on Zulip Sam Mohr (Feb 18 2025 at 02:57):

Unless there are compiler errors and we need to show diagnostics based on regions in the source, of course

view this post on Zulip Richard Feldman (Feb 18 2025 at 02:58):

gotcha, yeah that's a reasonable argument for layering a (mtime, inode, size) cache on top of the BLAKE3 hash cache, to avoid reading the files and BLAKE3-ing them unnecessarily in the happy path case :thumbs_up:

view this post on Zulip Richard Feldman (Feb 18 2025 at 02:59):

I don't think we need to do any special design work for that now - can add that on sometime after 0.1.0

view this post on Zulip Richard Feldman (Feb 18 2025 at 02:59):

but it is a good argument for not assuming we have the source bytes in memory

view this post on Zulip Richard Feldman (Feb 18 2025 at 03:00):

so yeah, maybe just interning them is simplest haha

view this post on Zulip Richard Feldman (Feb 18 2025 at 03:00):

(them being literals)

view this post on Zulip Sam Mohr (Feb 18 2025 at 03:00):

It's simpler for 0.1.0 as well

view this post on Zulip Sam Mohr (Feb 18 2025 at 03:00):

But not that much simpler

view this post on Zulip Richard Feldman (Feb 18 2025 at 03:02):

interning literals?

view this post on Zulip Sam Mohr (Feb 18 2025 at 03:10):

Yes

view this post on Zulip Joshua Warner (Feb 18 2025 at 03:42):

We can differentiate between interning (ensuring we have one unique copy of each) and just copying them

view this post on Zulip Joshua Warner (Feb 18 2025 at 03:42):

The latter is what we need to avoid re-reading the source file

view this post on Zulip Joshua Warner (Feb 18 2025 at 03:43):

The former is potentially a small binary size optimization, but this is probably the wrong layer to do that at

view this post on Zulip Joshua Warner (Feb 18 2025 at 03:43):

We need to ensure symbols are unique, but I don't see any particular reason we need to guarantee strings are

view this post on Zulip Sam Mohr (Feb 18 2025 at 03:57):

I'm talking about just copying, not interning, good distinction

view this post on Zulip Brendan Hansknecht (Feb 18 2025 at 04:43):

Yeah, string interner doesn't deduplicate

view this post on Zulip Brendan Hansknecht (Feb 18 2025 at 04:44):

It could, but unnecessary for the most part

view this post on Zulip Luke Boswell (Feb 22 2025 at 05:14):

I've made a draft PR with an initial implementation of Parser AST -> SExpr

https://github.com/roc-lang/roc/pull/7629

It doesn't really do a whole lot right now... but it compiles and runs :smiley:

test "example s-expr" {
    const source =
        \\module []
        \\
        \\foo = "bar"
    ;
    const expected =
        \\(header statement)
    ;

    try testHelper(testing.allocator, source, expected);
}

view this post on Zulip Luke Boswell (Feb 22 2025 at 05:14):

Intention will be to make it generate snapshots somehow... but while I'm developing the IR and wiring it up I'll probably use a unit test.

view this post on Zulip Luke Boswell (Feb 22 2025 at 23:38):

After going down this road a bit further... I'm thinking this may be the wrong approach.

view this post on Zulip Luke Boswell (Feb 22 2025 at 23:38):

The issue is how to work with the nested nodes and our SoA structure.

view this post on Zulip Luke Boswell (Feb 22 2025 at 23:38):

I'm going to try just using the standard zig format

view this post on Zulip Luke Boswell (Feb 22 2025 at 23:40):

That does mean we are providing a std.io.Writer, but I assume there is a way for that to not allocate, like maybe a heap allocated writer or similar.

view this post on Zulip Luke Boswell (Feb 23 2025 at 00:14):

@Joshua Warner @Anthony Bullard

How should I be getting the ident strings back out of the Parser AST?

I'm reasonably sure we just haven't got to this part yet, but wondering where the interned strings would be.

/// Represents a Pattern used in pattern matching.
pub const Pattern = union(enum) {
    ident: struct {
        ident_tok: TokenIdx, // <--- should this be an interned Ident.Idx instead?
        region: Region,
    },
pub const Token = struct {
    tag: Tag,
    offset: u32,
    extra: Extra,

    pub const Extra = union {
        length: u32,
        interned: base.Ident.Idx, <--- or should I be pulling it out from here??
    };

view this post on Zulip Luke Boswell (Feb 23 2025 at 00:15):

Or maybe using the offset and length and just indexing from the source bytes

view this post on Zulip Luke Boswell (Feb 23 2025 at 00:17):

Or maybe I should be using the Region somehow

view this post on Zulip Luke Boswell (Feb 23 2025 at 00:25):

Ok, yeah I'm pretty sure I should be grabbing them from the ModuleEnv using the Ident.Idx. So I'll need to take the env as an argument for the formatting helper.

view this post on Zulip Joshua Warner (Feb 23 2025 at 00:31):

Yeah you can fetch indents from the ModuleEnv

view this post on Zulip Joshua Warner (Feb 23 2025 at 00:32):

Strings and numbers and such you’ll have to fetch from the source for now, but my plan is to also stash those somewhere, possibly on the ModuleEnv

view this post on Zulip Luke Boswell (Feb 23 2025 at 00:46):

Ok got something working. Just building out more of the structure for a minimal SExpr test

view this post on Zulip Sam Mohr (Feb 23 2025 at 01:19):

@Joshua Warner I've currently got the Can IR assuming that we have StringLiteral.Idx for the text of numbers as well as strings: https://github.com/roc-lang/roc/blob/7fc2a08e2811fed7207ab5035f680bbf697d232f/src/check/canonicalize/IR.zig#L88

view this post on Zulip Brendan Hansknecht (Feb 23 2025 at 03:33):

Luke Boswell said:

After going down this road a bit further... I'm thinking this may be the wrong approach.

Would be glad to chat about this design space at some point or help with implementation issues. Overall, the perf isn't too important. Though good perf will help with fuzzing

view this post on Zulip Luke Boswell (Feb 23 2025 at 03:43):

It was more a reflection on my SExpr "library" and how I had planned to map between the IR's and that to generate a string.

My new approach is to make a toStr helper for things that can be implemented at each level in the IR (but still has the global SoA context). Probably a terrible description -- but I'll get something minimal working soon and share my PR for comments/review.

view this post on Zulip Luke Boswell (Feb 23 2025 at 03:44):

Basically I'm experimenting with ways to do it as I haven't built something like this before.

view this post on Zulip Luke Boswell (Feb 23 2025 at 04:28):

@Brendan Hansknecht interested to know what you think of this https://github.com/roc-lang/roc/pull/7629

Currently have this test passing.

test "example s-expr" {
    const source =
        \\module [foo, bar]
        \\
        \\foo = "hey"
        \\bar = "yo"
    ;
    const expected =
        \\(parse_ast (header "module" (exposes ("foo" "bar"))) (statements (decl (pattern (ident "foo")) (body ((expr "hey")))) (decl (pattern (ident "bar")) (body ((expr "yo"))))))
    ;

    try testSExprHelper(testing.allocator, source, expected);
}

view this post on Zulip Luke Boswell (Feb 23 2025 at 04:30):

Basic idea is to implement the following for various levels of the IR hierarchy and just call it recursively writing out the AST as it goes.

pub fn toStr(ir: *@This(), env: *base.ModuleEnv, writer: std.io.AnyWriter) !void { ... }

view this post on Zulip Brendan Hansknecht (Feb 23 2025 at 05:24):

Ah, but then you have to separately make a parser for every ir? Well, I guess the parser isn't strictly required. Just nice to have

view this post on Zulip Luke Boswell (Feb 23 2025 at 22:04):

@Brendan Hansknecht I was heading in the direction of having a separate SExpr IR -- see https://github.com/roc-lang/roc/blob/c72993c7519188698937bcc359d15e5b032ebe78/src/check/parse/SExpr.zig for an example.

This approach with toStr is my second attempt but taking a different direction.

I'm not sure how important it will be for parsing these IR's. That seems to be a tangental requirement in the nice to have category -- but also feels like a rabbit hole in itself.

I can appreciate the idea of build a list of tokens instead of writing directly to a buffer -- I was thinking this enables us to implement checks to ensure everything is well formed.

I am not opposed to returning to try the SExpr IR idea again, I think I would simplify it a littler further, only have the "node" itself a comptime type variable.

view this post on Zulip Brendan Hansknecht (Feb 23 2025 at 22:06):

Why does anything need comptime? (I think the comptime free IR I shared on the PR should be enough)

view this post on Zulip Brendan Hansknecht (Feb 23 2025 at 22:09):

I'm not sure how important it will be for parsing these IR's. That seems to be a tangental requirement in the nice to have category -- but also feels like a rabbit hole in itself.

I don't think we need to write any of the parsing right now. I think we can just write a basic node type, conversion logic from parse ir to that basic node type, and a printer.

view this post on Zulip Brendan Hansknecht (Feb 23 2025 at 22:09):

I think it could be super minimal

view this post on Zulip Brendan Hansknecht (Feb 23 2025 at 22:10):

Cause the sexpr ir should be trivial to print

view this post on Zulip Brendan Hansknecht (Feb 23 2025 at 22:10):

(and pretty trivial to parse too)

view this post on Zulip Luke Boswell (Feb 23 2025 at 22:11):

Yeah it's easy to do. I did the first version like that (without comptime).

view this post on Zulip Brendan Hansknecht (Feb 23 2025 at 22:11):

All this said, I don't want to block anything, but I don't understand the issue with a super simple sexpr ir as an intermediate.

view this post on Zulip Luke Boswell (Feb 23 2025 at 22:11):

I added the comptime so you could pass in a "parser" function to take the []u8 and give you back the nodes you expect -- or an error if it didn't recognise the node string

view this post on Zulip Luke Boswell (Feb 23 2025 at 22:12):

There's no issue. I'm just exploring the design space a little.

view this post on Zulip Brendan Hansknecht (Feb 23 2025 at 22:12):

Oh, I would just separately parse to sexpr ir, then simply pattern match on sexpr ir to convert to the parse ir.

view this post on Zulip Brendan Hansknecht (Feb 23 2025 at 22:13):

so parsing sexpr ir would accept all node types and names. Only the final conversion would invalidate specific like node types

view this post on Zulip Luke Boswell (Feb 23 2025 at 22:13):

That's probably simpler than where I was going. :smiley:

view this post on Zulip Jared Ramirez (Feb 24 2025 at 03:06):

to use sexprs to build IRs for later stages (anything after solve), then will it need to parse nodes that contain type information as well?

view this post on Zulip Brendan Hansknecht (Feb 24 2025 at 03:08):

Likely

view this post on Zulip Brendan Hansknecht (Feb 24 2025 at 03:08):

Though we can make type info a child node or side band if necessary (as long as we make a way to print it)

view this post on Zulip Brendan Hansknecht (Feb 24 2025 at 03:09):

Or if may be the case that we don't support parsing later IRs and they simply print in a lossy form. Haven't fully thought through this

view this post on Zulip Luke Boswell (Feb 24 2025 at 04:00):

Brendan Hansknecht said:

Or if may be the case that we don't support parsing later IRs and they simply print in a lossy form. Haven't fully thought through this

I've been thinking along these lines as one of the reasons why we may not want to "parse" the IR's.

view this post on Zulip Luke Boswell (Feb 24 2025 at 04:01):

The other thought is that these SExpr IR's aren't very nice to read. I'm starting to think it might be nicer just to make something more human readable that is unique for each IR that just includes the context relevant for that stage.

view this post on Zulip Jared Ramirez (Feb 24 2025 at 04:15):

Luke Boswell said:

The other thought is that these SExpr IR's aren't very nice to read. I'm starting to think it might be nicer just to make something more human readable that is unique for each IR that just includes the context relevant for that stage.

I was thinking a similar thing as I’ve been hand-writing some IRs (while exploring the function solving stage). It seems like sexprs that contain ast _plus_ type info could get in unwieldy.

view this post on Zulip Jared Ramirez (Feb 24 2025 at 04:24):

And each stage has different data, so there’s one main parser, it would need to support a bunch of nodes that are maybe only applicable to a single stage

view this post on Zulip Joshua Warner (Feb 24 2025 at 04:39):

My thinking was to do sexprs with perhaps some tasteful extensions. Keep it simple to print and parse but perhaps slightly less verbose

view this post on Zulip Brendan Hansknecht (Feb 24 2025 at 04:46):

I guess I see a two way split. I think it is fine to either:

  1. Go the way of the old compiler, just have pretty printing, and make it special per IR (give up parsing)
  2. Use s expression and for at least the high level IRs, but maybe all IRs eventually add parsing.

Just depends on our exact goals.

view this post on Zulip Brendan Hansknecht (Feb 24 2025 at 04:47):

Goal one is definitely snapshot tests (which requires printing and some reliability, but doesn't need to be perfect).

view this post on Zulip Brendan Hansknecht (Feb 24 2025 at 04:47):

Goal two is parsing for better fuzzing (probably also requires verification, which may actually not be reasonable).

view this post on Zulip Brendan Hansknecht (Feb 24 2025 at 04:49):

I think we could minorly expand expressions to support types and that would be reasonable (something to worry about later)

view this post on Zulip Brendan Hansknecht (Feb 24 2025 at 04:50):

So I would argue that s expressions leave more open for future improvements

view this post on Zulip Brendan Hansknecht (Feb 24 2025 at 04:51):

That said, the simplicity of bespoke printing and no parsing is also reasonable as long as we try to be consistent with printing.

view this post on Zulip Brendan Hansknecht (Feb 24 2025 at 04:51):

I definitely still lean s expr currently.

view this post on Zulip Luke Boswell (Feb 24 2025 at 08:35):

I've got something simple that is working ok. I'm going to continue with converting it into some small snapshots and trying a simple heuristic for pretty printing.

I'm busy all day tomorrow, but should be able to make some progress later in the week.

view this post on Zulip Luke Boswell (Feb 28 2025 at 05:00):

I've updated my PR and would like to merge what I have to prevent it going stale again.

I haven't quite figured out how to get the string parts back to working yet, but the rest is working well.

I also added an option to give either compact or pretty formatted strings. So now our unit test looks like this.

The next step is to generate actual snapshot files instead of having a unit test. But I'm pretty happy now with the API I think.

@Brendan Hansknecht can you cast your eye over my implementation - particularly the children: std.ArrayListUnmanaged(Node) part... :pray:

test "example s-expr" {
    const source =
        \\module [foo, bar]
        \\
        \\foo = "hey"
        \\bar = "yo"
    ;

    const expected =
        \\(file
        \\    (header
        \\        'foo'
        \\        'bar')
        \\    (decl
        \\        (ident
        \\            'foo')
        \\        (string_part))
        \\    (decl
        \\        (ident
        \\            'bar')
        \\        (string_part)))
    ;

    try testSExprHelper(source, expected);
}

view this post on Zulip Luke Boswell (Mar 01 2025 at 01:34):

Ok, I've had a crack at an initial implementation for a tool to run through all the snapshot files... https://github.com/roc-lang/roc/pull/7644

This PR adds a snapshot tool that iterates through all the snapshot text files in src/snapshots/*.txt and re-generates the AST/IR's for each compiler stage.

$ zig build snapshot
info: processed 2 snapshots in 1ms.
$ zig build snapshot -- --verbose
info: src/snapshots/002.txt
info: src/snapshots/001.txt
info: processed 2 snapshots in 2ms.

Here is an example of a snapshot file.

~~~META
description=Basic example to develop the snapshot methodology
~~~SOURCE
module [foo, bar]
foo = "one"
bar = "two"
~~~PARSE
(file
    (header
        'foo'
        'bar')
    (decl
        (ident
            'foo')
        (string_part))
    (decl
        (ident
            'bar')
        (string_part)))

view this post on Zulip Luke Boswell (Mar 01 2025 at 02:40):

@Isaac Van Doren -- how would you feel about me adding FORMAT as a section in these snapshot tests, instead of having a unit tests that write out to another file?

We could then also run the formatter on each snapshot file in our corpus.

view this post on Zulip Luke Boswell (Mar 01 2025 at 02:41):

If we think that's a good idea, I'm down to give it a try. We can land the format cli PR, I can rebase my snapshot PR and try it out

view this post on Zulip Luke Boswell (Mar 01 2025 at 02:41):

I guess I'm thinking of the "snapshot" tool as like our integration test suite...

view this post on Zulip Brendan Hansknecht (Mar 01 2025 at 02:43):

Yeah, sounds great!

Would also be great to make the format section optional. If the section is missing, it will then be required that the source section formats to itself.

view this post on Zulip Luke Boswell (Mar 01 2025 at 02:44):

Well... it's kind of one way.

view this post on Zulip Luke Boswell (Mar 01 2025 at 02:44):

So like, we put un-formatted code in the SOURCE section... it formats it and adds to the FORMATED section.

view this post on Zulip Luke Boswell (Mar 01 2025 at 02:44):

Then you can see in any changes in the git diff

view this post on Zulip Luke Boswell (Mar 01 2025 at 02:45):

Did you have another idea in mind @Brendan Hansknecht ? why would we want it optional?

view this post on Zulip Luke Boswell (Mar 01 2025 at 02:47):

Or maybe if the formatted code is identical we don't bother writing out the section... but I wonder if it's cheaper just to dump the bytes out than compare the strings.

view this post on Zulip Brendan Hansknecht (Mar 01 2025 at 02:47):

I guess to work with git diff, if only a SOURCE section. The process is read source, format it, and write it back to SOURCE (only source means it is already formatted correctly). This avoids printing the same chunk of code twice in a row if the formatting is already good.

If a FORMATTED section exist, we expect the SOURCE to change when formatted. So we need to write out both sections and they aren't redundant.

So just noise reduction, but ensure that all test case can format

view this post on Zulip Luke Boswell (Mar 01 2025 at 02:48):

Ahk that's a good idea. That way we will know if the formatting changes, but not duplicate things unnecessarily.

view this post on Zulip Brendan Hansknecht (Mar 01 2025 at 02:48):

Maybe that isn't needed, but I feel like for long examples, it will be nice to not print the source twice

view this post on Zulip Luke Boswell (Mar 01 2025 at 02:50):

FYI @Isaac Van Doren I merged your PR -- I'll adapt your integration tests into the snapshot tool as above.

view this post on Zulip Luke Boswell (Mar 01 2025 at 02:51):

@Brendan Hansknecht -- I assumed we want to build the snapshot tool by default so you can run the integration tests easily

view this post on Zulip Luke Boswell (Mar 01 2025 at 02:51):

I'm happy to make it a flag though. Just that was my assumption

view this post on Zulip Luke Boswell (Mar 01 2025 at 02:52):

I guess it should probably just sit behind zig build test also? or might we want to keep that for unit tests only?

view this post on Zulip Isaac Van Doren (Mar 01 2025 at 03:04):

Yeah validating formatting in the multi tiered snapshots sounds useful

view this post on Zulip Isaac Van Doren (Mar 01 2025 at 03:06):

But it won't be testing the same behavior as the test I wrote so I think we should keep it. That test wasn't meant to really check anything about the actual formatting itself, just that roc format correctly reads in the specified file and writes back out the formatted version.

view this post on Zulip Isaac Van Doren (Mar 01 2025 at 03:08):

Originally that test was e2e (it actually built the roc executable and invoked it). I think it would be good to stand up a separate e2e suite for the CLI which would be a good place for tests like that to live

view this post on Zulip Brendan Hansknecht (Mar 01 2025 at 03:14):

Luke Boswell said:

Brendan Hansknecht -- I assumed we want to build the snapshot tool by default so you can run the integration tests easily

That's probably fine.

Just thinking if someone only wants to build roc, they can just zig build and that can be roc and no other wasted time

view this post on Zulip Isaac Van Doren (Mar 01 2025 at 03:18):

Any concerns with a separate e2e suite for the CLI? I'm happy to set it up

view this post on Zulip Brendan Hansknecht (Mar 01 2025 at 03:23):

I feel like it would be pretty redundant with all of the snapshot tests. Not bad to add a few of, but it is probably preferable to avoid too many scattered tests

view this post on Zulip Brendan Hansknecht (Mar 01 2025 at 03:25):

Luke Boswell said:

I guess it should probably just sit behind zig build test also? or might we want to keep that for unit tests only?

If we put it behind, zig build test we just need to make it sure it is fast enough. Might hurt the dev loop, in which case, probably should be zig build test -Dsnapshot. I think really the separation is for ensure fast feedback.

I assume eventually we will have enough snapshot tests we will want to separate it out, but probably fine to start unified and only separate out if it gets too slow.

view this post on Zulip Luke Boswell (Mar 01 2025 at 03:34):

I might leave it as zig build snapshot for now... at least for this PR

view this post on Zulip Luke Boswell (Mar 01 2025 at 03:34):

I can imagine we'll iterate on the design a few more times at least

view this post on Zulip Isaac Van Doren (Mar 01 2025 at 03:47):

I feel like it would be pretty redundant with all of the snapshot tests. Not bad to add a few of, but it is probably preferable to avoid too many scattered tests

My intention would be to focus on focusing on the CLI/orchestration logic that wouldn't be tested by the snapshots at all, so I don't think there would be much overlap there. I wouldn't try to make it exhaustive, just for some confidence that changes to the CLI are correct without having to manually test everything.

view this post on Zulip Brendan Hansknecht (Mar 01 2025 at 03:47):

Ah, I thought you meant specifically formatting tests. Like another suite of tests but just for formatting.

view this post on Zulip Brendan Hansknecht (Mar 01 2025 at 03:48):

But like tests that format examples/stdlib (folder vs file vs invalid) or exercise various cli args sound great

view this post on Zulip Isaac Van Doren (Mar 01 2025 at 03:49):

Right, that's what I'm thinking :smiley:

view this post on Zulip Luke Boswell (Mar 01 2025 at 23:25):

Anyone looking for something to help out with, we should set up our snapshot tool to run in CI with the other zig things.

I haven't started looking at this yet. But it's ready to be added now.

view this post on Zulip Luke Boswell (Mar 01 2025 at 23:26):

I think the next thing I'd like to tackle is parsing just an expression (not a full file), and translate some of our syntax snapshot tests across to work on improving coverage there.

Maybe look at how we might handle multi file snapshots... :thinking:

view this post on Zulip Anton (Mar 03 2025 at 09:06):

we should set up our snapshot tool to run in CI

What are the commands to use it?

view this post on Zulip Luke Boswell (Mar 03 2025 at 10:11):

zig build snapshot
zig build snapshot -- --verbose also

view this post on Zulip Anton (Mar 03 2025 at 10:17):

I'll get on that after the roc nix package

view this post on Zulip Isaac Van Doren (Mar 03 2025 at 13:18):

zig build snapshot updates the golden values right? Why do we want to run it in CI?

view this post on Zulip Brendan Hansknecht (Mar 03 2025 at 16:40):

You run it, then a few extra git commands to ensure that none of the snapshot files changed

view this post on Zulip Brendan Hansknecht (Mar 03 2025 at 16:41):

If any change or if new ones pop into existence, it is a failure

view this post on Zulip Isaac Van Doren (Mar 03 2025 at 16:47):

Ah I see. I was imagining that there would be a command that would explicitly check that the snapshots matched the current output without requiring you to manually diff them

view this post on Zulip Joshua Warner (Mar 04 2025 at 15:19):

Small improvement to sexpr pretty printing: https://github.com/roc-lang/roc/pull/7659

view this post on Zulip Luke Boswell (Mar 06 2025 at 08:04):

What approach should we take for snapshots that are just a single expression...

For example in crates/compiler/test_syntax/tests/snapshots/pass/add_var_with_spaces.expr.roc we have a single expression

x + 2

Would we want snapshots this small, i.e. without being a full module?

I was thinking of migrating these across one by one as snapshots and using them to help add parser functionality.

view this post on Zulip Luke Boswell (Mar 06 2025 at 08:15):

Or perhaps we would want something like this instead... so turn them into a full module.

~~~META
description=Add a variable with spaces
~~~SOURCE
module [add2]

add2 = x + 2
~~~PARSE
(file
    (header 'add2')
    (decl
        (ident 'add2')
        (binop
            '+'
            (ident '' 'x')
            (int '2'))))

view this post on Zulip Luke Boswell (Mar 06 2025 at 08:18):

Also -- minor tangent while I'm looking at it... would we prefer (binop '+' <lhs> <rhs>) ... or switch on the token and make it more like (plus <lhs> <rhs>)

view this post on Zulip Sam Mohr (Mar 06 2025 at 08:21):

I'd vote for the former (AKA (binop ...)) since it's more consistent with how the rest of the S-expr stuff seems to work, and it gets desugared immediately in the next stage, so it's minor anyway

view this post on Zulip Anthony Bullard (Mar 06 2025 at 10:31):

Definitely want to be able to parse single syntax nodes (exprs, statements, header, etc)

view this post on Zulip Joshua Warner (Mar 06 2025 at 16:12):

Could just have type = Expr in the META section of the snapshot

view this post on Zulip Luke Boswell (Mar 07 2025 at 03:04):

It's really cool seeing this stuff come alive...

~~~META
description=Various type annotations
~~~SOURCE
module []

foo : U64
bar : Thing(a, b, _)
baz : (a, b, c)
add_one : (U8, U16 -> U32)
main! : List(String) -> Result({}, _)
~~~PARSE
(file
    (header)
    (type_anno
        'foo'
        (tag 'U64'))
    (type_anno
        'bar'
        (tag
            'Thing'
            (ty_var 'a')
            (ty_var 'b')
            (_)))
    (type_anno
        'baz'
        (tuple
            (ty_var 'a')
            (ty_var 'b')
            (ty_var 'c')))
    (type_anno
        'add_one'
        (fn
            (tag 'U32')
            (tag 'U8')
            (tag 'U16')))
    (type_anno
        'main!'
        (fn
            (tag
                'Result'
                (record)
                (_))
            (tag
                'List'
                (tag 'String')))))

view this post on Zulip Luke Boswell (Mar 07 2025 at 03:06):

Some simplifications I'm debating rn (just for the snapshots, not renaming tag unions in the zig source or anything)

There's a lot of design space here. It's nice that this is all one way and so easy to update the snapshots.

view this post on Zulip Anton (Mar 08 2025 at 09:39):

type_anno and ty_var are a lot clearer, I wouldn't change them if we don't experience noticeable problems due to verbosity.

view this post on Zulip Luke Boswell (Mar 08 2025 at 22:35):

I'd just like to say in case anyone is wondering Re the current state of snapshots, because they look somewhat strange.

I've just been dumping things in there to help figure out syntax issues with the parser and seed the fuzzer etc.

The current snapshots are pretty expendable and we will probably blow them away soon or change them a lot when we start having a usable Can etc.

view this post on Zulip Anthony Bullard (Mar 10 2025 at 13:43):

I think it's this output is really nice, but I _personally_ would prefer double quotes for strings. Are we using singles just to avoid having to worry about escaping?

view this post on Zulip Luke Boswell (Mar 11 2025 at 03:36):

Just wanted to show off a little... checkout this snapshot. I'm pretty happy with how this is starting to look. It's really easy to see at a glance where the issues are. You can feed inputs from fuzz crashes and dissect what the compiler is doing pretty quickly -- which has helped me find and fix a number of issues.

I'm loving the formatting for Problems also :smiley:

~~~META
description=fuzz crash
~~~SOURCE
H{o,
    ]

foo =

    "on        (string 'onmo %')))
~~~PROBLEMS
TOKENIZE: (2:4-2:4) AsciiControl:

    ]
   ^
TOKENIZE: (1:6-1:6) Expected the correct closing brace here:

    ]
      ^

TOKENIZE: (6:7-6:36) UnclosedString:

    "on        (string 'onmo %')))
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PARSER: missing_header
~~~TOKENS
UpperIdent,OpenCurly,LowerIdent,Comma,Newline,CloseCurly,Newline,LowerIdent,OpAssign,Newline,StringStart,StringPart,EndOfFile
~~~PARSE
(file
    (malformed_header 'missing_header')
    (record (field 'o'))
    (decl
        (ident 'foo')
        (string 'on        (string 'onmo %')))')))
~~~FORMATTED
{ o }

foo = "on        (string 'onmo %')))"
~~~END

view this post on Zulip Luke Boswell (Mar 11 2025 at 03:52):

Anthony Bullard said:

I think it's this output is really nice, but I _personally_ would prefer double quotes for strings. Are we using singles just to avoid having to worry about escaping?

I changed it to double quotes in https://github.com/roc-lang/roc/pull/7672

view this post on Zulip Anthony Bullard (May 13 2025 at 22:41):

Hey @Luke Boswell I'm dealing with an issue with Zig where TLDR we want to move to using TAB as the canonical indentation character in the formatter but Zig multiline string literals do not support using TAB in them. So long story short I need to move all of the current fmt tests to snapshots (I don't have to per se, but I think it's the logical move).

In order to do that I am moving to make snapshots support different IR nodes like the old test_syntax snapshots did. Are you aligned with this? It means that I need to expose some extra methods on the Parse IR so that we can output the SExpr and format correctly.

view this post on Zulip Luke Boswell (May 14 2025 at 07:11):

I'm not sure what you mean by

make snapshots support different IR nodes like the old test_syntax snapshots did

view this post on Zulip Luke Boswell (May 14 2025 at 07:12):

In general though, I think moving the fmt tests to snapshots is a great idea!

view this post on Zulip Luke Boswell (May 14 2025 at 07:13):

That was my intention too, but I figured you wanted the kitchen sink in there because it was easier to work with while developing the Parser initial implementation.

view this post on Zulip Anthony Bullard (May 14 2025 at 10:15):

Luke Boswell said:

I'm not sure what you mean by

make snapshots support different IR nodes like the old test_syntax snapshots did

sorry i mean have it support snapshots of just exprs, statements, or headers instead of just modules

view this post on Zulip Luke Boswell (May 14 2025 at 10:48):

yeah, that is something we need to support anyway (imo)


Last updated: Jul 06 2025 at 12:14 UTC