Hi! I'm starting on function specialization. I'd like to create a test that has the correct input data in it. Since I'm at the veeery beginning of learning, can someone give me an example how I could build up an IR in zig manually (with module Env all that)? Just for example this code:
increment = |num| num + 1
main = || {
x = 2
increment(x)
}
I will be available after the usual Sunday family dinner :)
The best approach seems to be creating an S-expr parser and printer for each IR
We could manually create nodes, but that's probably more trouble than it's worth
If you really want to go that way, let me know and I can make a branch with an example
I agree a serializable format would be better. I guess there's no need for me to rush straight to tackling specialization, if there are things that would make it easier.
Any chance we can share one printer/parser?
Are the IRs similar enough?
From a quick look, yes, they seem pretty similar. At least from what I have seen, in function lift and function solve IR. But after reading Ayas' document, they should be the same as they are just more and more simplified versions of one-another.
I more meant in datastructure form such that a tiny bit of zig comptime could deal with any of them for printing and parsing. And I know they all should have similar datastrtutures. Just not sure if they are close enough in practice.
You mean creating a general IRPrinter that can take any of the IR's and spit out the correct strings? (Same for the parse)
Yeah
Given we have so many IRs and they are all laid out very similar, sounds doable
I see. Would be cool and should strive for it I think! I just started to work on a printer. I'm still in the Zig phase of "What.. a file is a struct?", so I don't know when that will be usable. Right now my plan is to create 'serialize' functions on all the base.zig types. They would take a std.io.AnyWriter as argument. Once I can print those, I can widen my horizon.
Sure
Exploration is good right now in general
My thinking was to create a generic sexpr parser, and have each IR be constructable from that sexpr representation. Similarly for printing - irs only need to know how to output sexpr nodes.
Do I understand it correctly that Sexpr would be an Intermediate representation for all the intermadiate representations during serialization / deserialization :big_smile: ? So we would go from {check_IR, lift_IR, ...}->Sexpr->String and similarly backwards
This topic was moved here from #compiler development > help creating a basic IR by Norbert Hajagos.
Yep, that was my thought
I like that
That'll let us have one parser and one pretty-printer
How do you think a common printer and parser would be implemented in Zig? I don’t have a clear idea how that would work
Sounds like they were thinking just making a generic intermediate IR. Then having everything go to/from that data structure.
Otherwise, you can do interfaces by doing what allocators do with essentially a vtable. Could make one that every ir implements that is used for printing/parsing
Lastly, by making comptime interfaces, you essentially have traits from rust
When we are spinning this up, if anyone needs help with zig interfaces and what not, let me know. I think if you design the interface as you see fit ignoring zig, I can help translate to zig.
I'll be cooking up something then and put progress updates in this thread.
I already have a zig question :big_smile:
The type of base.Literal.String is of collections.interner.StringLiteral
I wanted to experiment with printing each kind of literal, but the StringLiteral type doesn't seem to hold any value. It's only a namespace for the Idx and the Interner type. I undestand that with the combination of an interner and an index, i can query the string value, but does the current representation of base.Literal.String doesn't have either, so it holds 0 information.
Should the base.Literal.String be of type collections.interner.StringLiteral.Idx, or am I missing something?
The IRs in those later build stages are mostly a placeholder. It's possible it just doesn't make sense. Im afk rn but can help poke at it in a couple of hours.
Thanks, but in a couple of hours, I'll already be sleeping :smile:. Think I'll just change it to Idx, see how things will puzzle out.
Yes, it should be Idx
The lazy compilation thing has been pretty tricky for starting this project, though I'm sure it'll not be as annoying in a couple weeks
It feels so good to slowly get a hold on what's happening!
You mean the modules not being connected and thus not checked?
Yep
As I was exploring, I stumbled upon a memory leak in the string interner. Fixed it on my branch (testing allocator is so good, just running zig build test
does the job).
But... it just feels wrong to loop through strings and free them one-by-one. After all these times of hearing about them, I'm pretty sure this is where I'd need an arena. I tried to wrap the interner's allocator in one, but it kept leaking anyways, so I'm asking if any one of you could guide me here.
Faild attempt at using arena:
Was a fun weekend on the compiler, tomorrow I'll catch up. Bye!
yeah I think we need arenas for everything, including string interning, so that we can serialize them :big_smile:
I guess except like...scratch allocations that aren't going to end up being serialized maybe
Yep
I think having some naming pattern around scratch_arena
vs. just arena
could help there
Since Zig seems to lean into the gpa
meaning "use a general-purpose allocator here" naming convention
yeah I'm not sure if we want gpa or not for those
@Andrew Kelley pointed out that bump arenas are not conducive to arraylists that need to resize because they're super unlikely to be able to resize in-place
whereas I guess with a gpa it's more likely
I'm not sure how to think about the tradeoffs there :sweat_smile:
We want gpa or c allocator for all the arraylists
We want arenas for basically everything else
For arraylists, the amortization of growing by 1.5 to 2x and reasonable capacity defaults will make it cheap
Likely arenas we just have to be careful minorly of lifetimes.
Yeah, and I think we should have a pattern of all allocators being named arena
or gpa
. I guess we could get more fine grain, but that is likely enough by context of the datastructure. Like the parsing arena won't out live the parsing IR.
well everything that's going to be serialized needs to use the arena
There will be one arena for that which CanIR
will own, and it'll get provided to each module's ModuleEnv
as their allocator. We should call the allocator for ModuleEnv
"arena" to make that more obvious
And by CanIR
owning the arena, then the serialization work just falls out of that for free
Richard Feldman said:
well everything that's going to be serialized needs to use the arena
:thinking:??? Why?
So parse
will intern data into the CanIR
arena but put its IR into a different allocator's memory, canonicalize will put its intern data (e.g. generated idents created during desugaring) into the arena
If everything is in a single arena, then we get (de)serialization for free
They just need to use indices instead or pointers
So if there's a single data structure, CanIR
, in the arena, which includes the ModuleEnv
, then we can just treat that arena as binary data that goes into/out of the cache
The ir needs to be in a gpa or you will hit big pains on growth. On top of that, if your arena has tons of gaps, you don't want to serialize it.
IR for everything besides canonicalization would be in a GPA
I think it is better to think of serialization as writing out a struct containing each of the core arrays of data. Each array only uses indices and not pointers and thus can just be copied to disk.
I would not consider serializing arenas. They will have tons of gaps and garbage data from intermediates.
Also, references to data in an arena are still pointers (thus not safe to just serialize)
There should be no references in the arena, I agree
All intermediates besides arraylist's backing arrays should be created within other allocators
So a perf question:
Which would be faster: creating an mmap
ed arena that we don't have to serialize, or using a gpa
to get a more compact memory backing for the serialized data and then manually copying to/from the cache?
Because I don't think we can just mmap
a file in the second case
But if it's similar perf, then maybe the arena-based approach isn't worth the effort
(deleted)
Wrong button clicked...
Since it would create a larger cache artifact to use the arena, as you say
Meant to edit not delete
How I see it working.
I for sure agree if this approach is about as performant as Richard's idea
I've been presuming Richard's way must be faster for us to consider it, but that may very well be wrong
so briefly, the whole goal of the serialization is that reading a cache into memory is 2 syscalls: open the file, and then read it into memory. That's it, there's no parsing step whatsoever - once it's in memory, it is already ready to use normally
Yes
I'm not sure how that works outside of an arena :sweat_smile:
Well, you may want to do some minor range processing to get zig slices instead of raw offsets for some things, but that is derived from the raw mmap
the thing is, in bigger projects (where compile times are noticeable) reading from cache is going to be overwhelmingly how things get loaded into memory
I think you think arenas solve an issue they don't
An arena is an linked list of chunks of memory allocated from another allocator
Chunks are large and cheap to allocate.
oh 100% this doesn't work if the arena is backed by linked list haha
has to be big virtual allocations
that reallocate and copy if necessary (like an arraylist does), which is fine if everything is indices
I guess what I'm trying to say is separate initial construction from cache form.
Initial construction will be fastest and easiest to do with large gpa backed arraylists
After done, you have a set of slices all using indicies and no pointers.
a critical part of the "2 syscalls" idea is that the very beginning of the allocation (as in, index zero) has to be the header which holds all the other arraylists (and then from there everything can be indices into those)
hm, but I guess writev can guarantee that actually :thinking:
Yes, when serializing, you do so flat (header of slice info) then all those slices constructed above with the gpa.
You serialize flat into a static sized file. This can be loaded with the two syscall approach
ok separate question though: if we're running this persistently, and we've loaded something from cache, we explicitly don't want to dealloc everything individually
whereas we must do that if it's long-running (e.g. for language server)
I guess we can just have a flag of like
"this was loaded from cache, so dealloc it all at once"
You can dealloc at the module level. Each module will have its own mmap or allocators
yeah that seems reasonable!
separately, I think we'll want to do some giant virtual alloc up front and then read all the cache entries into that
well, that's in a batch build I guess
prob not language server bc we might need to dealloc those
I don't think it will make a difference
MMaps are lazily loaded and page fault to load data. Force loading them into one big allocation would force being everything to be copied an extra time.
It might be beneficial to load everything into CPU cache early, but if so, we can just force non-lazy loading of the mmaps
And due to everything being handled at the page level anyway, doesn't really matter if we have "scattered" mmap pages or "linear" mmap pages. Both are totally up to the os virtual to physical page map to where they will actually be in memory
As a note, the string intern probably needs to be two gpa backed arraylists. One of starts and lengths for the interned strings. And one flat U8 arraylists of string content. That will make serialization easy.
Aside, does string interning do deduplication if so, it probably also needs an hashmap of string to index it is inserted at. No need to serialize the hashmap cause the string intern will be finalized when serialized to disk.
ah interesting, I hadn't been thinking of mmap for these
that would change things :thinking:
Oh, I thought the point of making a flat serializable datastructure was to enable mmaping them
Oh, you mean the string interner specifically?
We would either need to serialize the interner or would need to reference the original source to reload the strings
I definitely figured we'd be serializing the interner, which would "own" the strings within the arena instead of referencing the source file or something
Yeah, I agree. Makes sense to do so when serializing can ir.
Brendan Hansknecht said:
- Writing out to cache will require moving data to disk and writing everything out flat. Should just be a couple of really big mem copies and quite cheap
btw this can be even cheaper than that! writev
can let you skip the memcpys, and there's an equivalent on windows
Brendan Hansknecht said:
Oh, I thought the point of making a flat serializable datastructure was to enable mmaping them
I was just thinking of doing one big read
on the whole thing
Yeah, that sounds fine too (can test both and see which is faster, minor detail on how the data is loaded)
Richard Feldman said:
Brendan Hansknecht said:
- Writing out to cache will require moving data to disk and writing everything out flat. Should just be a couple of really big mem copies and quite cheap
btw this can be even cheaper than that!
writev
can let you skip the memcpys, and there's an equivalent on windows
Yeah, I said memcpy, but I really meant a copy to disk. Which would be writeev or whatever system call.
mmap on anything not in a tempdir always makes me nervous that some other process is going to delete the file while the mmap is active and cause chaos
although wait, didn't you say there's a setting that can prevent that? :thinking:
Someone deletes the roc cache while we are compiling
yeah
Yeah, if so, reading is fine. Then less chance for failure at the cost of eagerly loading everything (which could actually be a perf gain, really depends on usage pattern). Could be worse for memory use, but maybe not in any sort of meaningful way.
But yeah, read or mmap or etc all sound fine and we can figure out the tradeoffs later.
Sounds like mmap counts as an extra reference count on a file and it won't actually be deleted until the mmaps is closed (on Linux)
https://stackoverflow.com/questions/42961339/mmap-after-deleting-the-file#50376124
So deleting it will remove it from the file tree, but the lnode wont be deleted until the file is until it is safe
Would definitely want to test that behaviour to make sure or look it up in docs somewhere
Oh, though mutation on an mmaped file still may affect things, but if a user is mutating cache files, that is their own problem
yeah I agree
mutating cache files is UB anyway
regardless of when you do it
(in this case)
but deleting cache files should always be safe
so that sounds like a great option to me!
in that design the 2 syscalls are open
and mmap
, and then afterwards we can munmap
if we're a language server and want to selectively clean up
this sounds awesome :smiley:
Yep
Brendan Hansknecht said:
Likely arenas we just have to be careful minorly of lifetimes.
ArrayList didn't grow this debug feature yet but some of the data structures such as hash maps have "pointer locks", so if you want to keep element pointers alive for a nontrivial amount of function call graph, you can do something like this:
map.lockPointers();
defer map.unlockPointers();
and it will crash if you called any of the non "assume capacity" methods. i.e. you get a crash instead of potential Use-After-Free
in release fast mode those are no-ops.
Richard Feldman said:
I'm not sure how that works outside of an arena :sweat_smile:
Thanks
Yeah, that matches what I expected
Also, cool debug feature
Brendan Hansknecht said:
As a note, the string intern probably needs to be two gpa backed arraylists. One of starts and lengths for the interned strings. And one flat U8 arraylists of string content. That will make serialization easy.
Aside, does string interning do deduplication if so, it probably also needs an hashmap of string to index it is inserted at. No need to serialize the hashmap cause the string intern will be finalized when serialized to disk.
This is the comment above the StringLiteral.Interner:
/// As opposed to the [IdentName.Interner], this interner does not
/// deduplicate its values because they are expected to be almost all unique,
/// and also larger, meaning more expensive to do equality checking on.
I also thought of something similar, since the strings won't be changing in the buffer. How do you feel about a design like this?
// just after inicialization
expect interner.buff == ""
expect interner.start_positions == [0]
//inserting 2 strings
const hi_idx = interner.insert("hi"); // 0
const there_idx = interner.insert("there"); // 1
//results with
expect interner.buff == "hithere"
expect interner.start_positions == [0, 2, 7]
// no need to store start and length, since the next value in start_positions holds where the string ends
expect interner.get(there_idx) == interner.buff[interner.start_positions[there_idx]..interner.start_positions[there_idx +1]]
Would be happy to implement it. What type should the start_positions arraylist hold? If only string literals from source code will be in the interner, u32 would be enough. Source files with > 4Gb str literals aren't realistic imo. However, if we stored embedded files in these as well (think import "some-file" as some_str : Str
) then I'd use a usize
just in case. I'm easily convinced it's an overkill though.
null termination is viable as well. with short strings the cpu cache savings can be worth it
Or a previous strategy I’ve seen, of prepending the string with a varint length, encoded backwards, and placed before the offset you return.
Cool! Could do either. @Joshua Warner do you think storing the length that way would be beneficial over having null terminated strings?
The advantage would be in making computing the length of strings >= ~64 bytes (a cache line) faster
(that said, I think it's unlikely we'll have many identifiers >= 64 bytes, so :shrug: )
This is the string literal Interner. We will have some large ones
Norbert Hajagos said:
Would be happy to implement it. What type should the start_positions arraylist hold? If only string literals from source code will be in the interner, u32 would be enough. Source files with > 4Gb str literals aren't realistic imo. However, if we stored embedded files in these as well (think
import "some-file" as some_str : Str
) then I'd use ausize
just in case. I'm easily convinced it's an overkill though.
I think we should handle imported sources special and just never put them in interners. That would be a waste. And since the interners are per module, the strings from a single module can never be over 4GB.
should literals be interned at all? I don't see the benefit honestly :sweat_smile:
Only benefit is that you can close files....so maybe not worth it.
Might be better to just reference back to the original file
I guess also maybe minor cache gains later in the compiler
But yeah, I think the only real gain is if it saves a lot of memory in large roc projects due to being able to free files.
Presumably we would want to serialize that data with the can IR (or whatever else we're caching)
That way we only have to read the can IR and not the original file (e.g. supposing we're confident it hasn't changed - and assuming we didn't need to hash it to figure that out)
we already have to read the original file into memory in order to hash it, because the hash is the key for the cached can IR
and the can IR is a hash of the source bytes, so we know all the offsets into the source file will still be valid! :smiley:
IMO we ought to skip hashing most of the time
That sort of thing can really matter for large projects
(this is exactly what git does - it doesn't rehash all files in your git dir every time you commit - only the ones that have a different stat)
If we can only read one file (the can ir), and not seek the disk to read the original file again, that's a substantial speed boost
Note: this is the kind of thing that is super easy to change later
So either way we go should be fine.
But yeah, especially on a HDD, reading only one could be big gains
I'm not sure how we could skip it :thinking:
like if the CLI receives an instruction to check main.roc
, how would it know what cached IR to load without reading and hashing the source bytes?
(it's different if it's a persistent process that's watching the filesystem for changes, of course)
Oh yeah, the hash of the file is the key to the cache.
So hashing is required unless we come up with a different key
Joshua Warner said:
Or a previous strategy I’ve seen, of prepending the string with a varint length, encoded backwards, and placed before the offset you return.
This never occured to me. Cool idea!
Richard Feldman said:
should literals be interned at all? I don't see the benefit honestly :sweat_smile:
In zig we do it because it helps simplify our memory model, particularly with serialization
I must be missing something because I don't see the connection with files
if the literal doesn't have any escapes in it, you can just store the region in the original source file
because those are the exact bytes you want anyway
Richard Feldman said:
like if the CLI receives an instruction to check
main.roc
, how would it know what cached IR to load without reading and hashing the source bytes?
Have a cache file per source file path, and store the size, mtime, inode. If any of those change, cache miss. We do this in zig for source -> ZIR
Oh, we don't have source files, tokens, or ast nodes loaded at all in the happy path
we could do the size/mtime/inode/etc as an extra layer, but if any of those change we still have to redo the hash computation
so the tradeoff for the cache hit scenario would be:
in the cache miss scenario, we might discover that we need to re-hash sooner...but then the first thing we do in a cache miss is hash anyway, to know what to name the new cache file :big_smile:
so I guess another way to say it is that we re-hash whether there's going to be a cache miss or cache hit, so might as well just always do it, yeah?
I think you should be able to avoid rehashing when you have a hit
In our case it would be extremely wasteful since we don't even need access to the file contents on a cache hit
Maybe I'm not understanding what you mean when you say "hash". Are you talking about the file contents?
I vaguely recall something about assigning a unique integer per identifier across all source files or something like that. I'm guessing that's what you mean by "hash"
by hash I specifically mean a BLAKE3 hash of the entire contents of the .roc
source file
:thinking: how do we know we have a hit without rehashing though?
(mtime, inode, size) is generally considered to be reliable enough. As a bonus you can try to detect mtime granularity and avoid false positives in case the file system has high granularity (i.e. NFS, which nobody should be using)
fstat cache hits are going to be a lot faster than hashing cache hits. Especially if you don't need to load the file contents on hit
If we intern string literals, then I'm pretty sure we wouldn't need anything from the source file, just the cache contents
Unless there are compiler errors and we need to show diagnostics based on regions in the source, of course
gotcha, yeah that's a reasonable argument for layering a (mtime, inode, size) cache on top of the BLAKE3 hash cache, to avoid reading the files and BLAKE3-ing them unnecessarily in the happy path case :thumbs_up:
I don't think we need to do any special design work for that now - can add that on sometime after 0.1.0
but it is a good argument for not assuming we have the source bytes in memory
so yeah, maybe just interning them is simplest haha
(them being literals)
It's simpler for 0.1.0 as well
But not that much simpler
interning literals?
Yes
We can differentiate between interning (ensuring we have one unique copy of each) and just copying them
The latter is what we need to avoid re-reading the source file
The former is potentially a small binary size optimization, but this is probably the wrong layer to do that at
We need to ensure symbols are unique, but I don't see any particular reason we need to guarantee strings are
I'm talking about just copying, not interning, good distinction
Yeah, string interner doesn't deduplicate
It could, but unnecessary for the most part
I've made a draft PR with an initial implementation of Parser AST -> SExpr
https://github.com/roc-lang/roc/pull/7629
It doesn't really do a whole lot right now... but it compiles and runs :smiley:
test "example s-expr" {
const source =
\\module []
\\
\\foo = "bar"
;
const expected =
\\(header statement)
;
try testHelper(testing.allocator, source, expected);
}
Intention will be to make it generate snapshots somehow... but while I'm developing the IR and wiring it up I'll probably use a unit test.
After going down this road a bit further... I'm thinking this may be the wrong approach.
The issue is how to work with the nested nodes and our SoA structure.
I'm going to try just using the standard zig format
That does mean we are providing a std.io.Writer
, but I assume there is a way for that to not allocate, like maybe a heap allocated writer or similar.
@Joshua Warner @Anthony Bullard
How should I be getting the ident strings back out of the Parser AST?
I'm reasonably sure we just haven't got to this part yet, but wondering where the interned strings would be.
/// Represents a Pattern used in pattern matching.
pub const Pattern = union(enum) {
ident: struct {
ident_tok: TokenIdx, // <--- should this be an interned Ident.Idx instead?
region: Region,
},
pub const Token = struct {
tag: Tag,
offset: u32,
extra: Extra,
pub const Extra = union {
length: u32,
interned: base.Ident.Idx, <--- or should I be pulling it out from here??
};
Or maybe using the offset
and length
and just indexing from the source bytes
Or maybe I should be using the Region
somehow
Ok, yeah I'm pretty sure I should be grabbing them from the ModuleEnv
using the Ident.Idx
. So I'll need to take the env as an argument for the formatting helper.
Yeah you can fetch indents from the ModuleEnv
Strings and numbers and such you’ll have to fetch from the source for now, but my plan is to also stash those somewhere, possibly on the ModuleEnv
Ok got something working. Just building out more of the structure for a minimal SExpr test
@Joshua Warner I've currently got the Can IR assuming that we have StringLiteral.Idx
for the text of numbers as well as strings: https://github.com/roc-lang/roc/blob/7fc2a08e2811fed7207ab5035f680bbf697d232f/src/check/canonicalize/IR.zig#L88
Luke Boswell said:
After going down this road a bit further... I'm thinking this may be the wrong approach.
Would be glad to chat about this design space at some point or help with implementation issues. Overall, the perf isn't too important. Though good perf will help with fuzzing
It was more a reflection on my SExpr
"library" and how I had planned to map between the IR's and that to generate a string.
My new approach is to make a toStr
helper for things that can be implemented at each level in the IR (but still has the global SoA context). Probably a terrible description -- but I'll get something minimal working soon and share my PR for comments/review.
Basically I'm experimenting with ways to do it as I haven't built something like this before.
@Brendan Hansknecht interested to know what you think of this https://github.com/roc-lang/roc/pull/7629
Currently have this test passing.
test "example s-expr" {
const source =
\\module [foo, bar]
\\
\\foo = "hey"
\\bar = "yo"
;
const expected =
\\(parse_ast (header "module" (exposes ("foo" "bar"))) (statements (decl (pattern (ident "foo")) (body ((expr "hey")))) (decl (pattern (ident "bar")) (body ((expr "yo"))))))
;
try testSExprHelper(testing.allocator, source, expected);
}
Basic idea is to implement the following for various levels of the IR hierarchy and just call it recursively writing out the AST as it goes.
pub fn toStr(ir: *@This(), env: *base.ModuleEnv, writer: std.io.AnyWriter) !void { ... }
Ah, but then you have to separately make a parser for every ir? Well, I guess the parser isn't strictly required. Just nice to have
@Brendan Hansknecht I was heading in the direction of having a separate SExpr IR -- see https://github.com/roc-lang/roc/blob/c72993c7519188698937bcc359d15e5b032ebe78/src/check/parse/SExpr.zig for an example.
This approach with toStr
is my second attempt but taking a different direction.
I'm not sure how important it will be for parsing these IR's. That seems to be a tangental requirement in the nice to have category -- but also feels like a rabbit hole in itself.
I can appreciate the idea of build a list of tokens instead of writing directly to a buffer -- I was thinking this enables us to implement checks to ensure everything is well formed.
I am not opposed to returning to try the SExpr IR idea again, I think I would simplify it a littler further, only have the "node" itself a comptime type variable.
Why does anything need comptime? (I think the comptime free IR I shared on the PR should be enough)
I'm not sure how important it will be for parsing these IR's. That seems to be a tangental requirement in the nice to have category -- but also feels like a rabbit hole in itself.
I don't think we need to write any of the parsing right now. I think we can just write a basic node type, conversion logic from parse ir to that basic node type, and a printer.
I think it could be super minimal
Cause the sexpr ir should be trivial to print
(and pretty trivial to parse too)
Yeah it's easy to do. I did the first version like that (without comptime).
All this said, I don't want to block anything, but I don't understand the issue with a super simple sexpr ir as an intermediate.
I added the comptime so you could pass in a "parser" function to take the []u8
and give you back the nodes you expect -- or an error if it didn't recognise the node string
There's no issue. I'm just exploring the design space a little.
Oh, I would just separately parse to sexpr ir, then simply pattern match on sexpr ir to convert to the parse ir.
so parsing sexpr ir would accept all node types and names. Only the final conversion would invalidate specific like node types
That's probably simpler than where I was going. :smiley:
to use sexprs to build IRs for later stages (anything after solve), then will it need to parse nodes that contain type information as well?
Likely
Though we can make type info a child node or side band if necessary (as long as we make a way to print it)
Or if may be the case that we don't support parsing later IRs and they simply print in a lossy form. Haven't fully thought through this
Brendan Hansknecht said:
Or if may be the case that we don't support parsing later IRs and they simply print in a lossy form. Haven't fully thought through this
I've been thinking along these lines as one of the reasons why we may not want to "parse" the IR's.
The other thought is that these SExpr IR's aren't very nice to read. I'm starting to think it might be nicer just to make something more human readable that is unique for each IR that just includes the context relevant for that stage.
 Luke Boswell said:
The other thought is that these SExpr IR's aren't very nice to read. I'm starting to think it might be nicer just to make something more human readable that is unique for each IR that just includes the context relevant for that stage.
I was thinking a similar thing as I’ve been hand-writing some IRs (while exploring the function solving stage). It seems like sexprs that contain ast _plus_ type info could get in unwieldy.
And each stage has different data, so there’s one main parser, it would need to support a bunch of nodes that are maybe only applicable to a single stage
My thinking was to do sexprs with perhaps some tasteful extensions. Keep it simple to print and parse but perhaps slightly less verbose
I guess I see a two way split. I think it is fine to either:
Just depends on our exact goals.
Goal one is definitely snapshot tests (which requires printing and some reliability, but doesn't need to be perfect).
Goal two is parsing for better fuzzing (probably also requires verification, which may actually not be reasonable).
I think we could minorly expand expressions to support types and that would be reasonable (something to worry about later)
So I would argue that s expressions leave more open for future improvements
That said, the simplicity of bespoke printing and no parsing is also reasonable as long as we try to be consistent with printing.
I definitely still lean s expr currently.
I've got something simple that is working ok. I'm going to continue with converting it into some small snapshots and trying a simple heuristic for pretty printing.
I'm busy all day tomorrow, but should be able to make some progress later in the week.
I've updated my PR and would like to merge what I have to prevent it going stale again.
I haven't quite figured out how to get the string parts back to working yet, but the rest is working well.
I also added an option to give either compact or pretty formatted strings. So now our unit test looks like this.
The next step is to generate actual snapshot files instead of having a unit test. But I'm pretty happy now with the API I think.
@Brendan Hansknecht can you cast your eye over my implementation - particularly the children: std.ArrayListUnmanaged(Node)
part... :pray:
test "example s-expr" {
const source =
\\module [foo, bar]
\\
\\foo = "hey"
\\bar = "yo"
;
const expected =
\\(file
\\ (header
\\ 'foo'
\\ 'bar')
\\ (decl
\\ (ident
\\ 'foo')
\\ (string_part))
\\ (decl
\\ (ident
\\ 'bar')
\\ (string_part)))
;
try testSExprHelper(source, expected);
}
Ok, I've had a crack at an initial implementation for a tool to run through all the snapshot files... https://github.com/roc-lang/roc/pull/7644
This PR adds a snapshot tool that iterates through all the snapshot text files in src/snapshots/*.txt
and re-generates the AST/IR's for each compiler stage.
$ zig build snapshot
info: processed 2 snapshots in 1ms.
$ zig build snapshot -- --verbose
info: src/snapshots/002.txt
info: src/snapshots/001.txt
info: processed 2 snapshots in 2ms.
Here is an example of a snapshot file.
~~~META
description=Basic example to develop the snapshot methodology
~~~SOURCE
module [foo, bar]
foo = "one"
bar = "two"
~~~PARSE
(file
(header
'foo'
'bar')
(decl
(ident
'foo')
(string_part))
(decl
(ident
'bar')
(string_part)))
@Isaac Van Doren -- how would you feel about me adding FORMAT
as a section in these snapshot tests, instead of having a unit tests that write out to another file?
We could then also run the formatter on each snapshot file in our corpus.
If we think that's a good idea, I'm down to give it a try. We can land the format cli PR, I can rebase my snapshot PR and try it out
I guess I'm thinking of the "snapshot" tool as like our integration test suite...
Yeah, sounds great!
Would also be great to make the format section optional. If the section is missing, it will then be required that the source section formats to itself.
Well... it's kind of one way.
So like, we put un-formatted code in the SOURCE section... it formats it and adds to the FORMATED section.
Then you can see in any changes in the git diff
Did you have another idea in mind @Brendan Hansknecht ? why would we want it optional?
Or maybe if the formatted code is identical we don't bother writing out the section... but I wonder if it's cheaper just to dump the bytes out than compare the strings.
I guess to work with git diff, if only a SOURCE section. The process is read source, format it, and write it back to SOURCE (only source means it is already formatted correctly). This avoids printing the same chunk of code twice in a row if the formatting is already good.
If a FORMATTED section exist, we expect the SOURCE to change when formatted. So we need to write out both sections and they aren't redundant.
So just noise reduction, but ensure that all test case can format
Ahk that's a good idea. That way we will know if the formatting changes, but not duplicate things unnecessarily.
Maybe that isn't needed, but I feel like for long examples, it will be nice to not print the source twice
FYI @Isaac Van Doren I merged your PR -- I'll adapt your integration tests into the snapshot tool as above.
@Brendan Hansknecht -- I assumed we want to build the snapshot tool by default so you can run the integration tests easily
I'm happy to make it a flag though. Just that was my assumption
I guess it should probably just sit behind zig build test
also? or might we want to keep that for unit tests only?
Yeah validating formatting in the multi tiered snapshots sounds useful
But it won't be testing the same behavior as the test I wrote so I think we should keep it. That test wasn't meant to really check anything about the actual formatting itself, just that roc format
correctly reads in the specified file and writes back out the formatted version.
Originally that test was e2e (it actually built the roc executable and invoked it). I think it would be good to stand up a separate e2e suite for the CLI which would be a good place for tests like that to live
Luke Boswell said:
Brendan Hansknecht -- I assumed we want to build the snapshot tool by default so you can run the integration tests easily
That's probably fine.
Just thinking if someone only wants to build roc, they can just zig build
and that can be roc and no other wasted time
Any concerns with a separate e2e suite for the CLI? I'm happy to set it up
I feel like it would be pretty redundant with all of the snapshot tests. Not bad to add a few of, but it is probably preferable to avoid too many scattered tests
Luke Boswell said:
I guess it should probably just sit behind
zig build test
also? or might we want to keep that for unit tests only?
If we put it behind, zig build test
we just need to make it sure it is fast enough. Might hurt the dev loop, in which case, probably should be zig build test -Dsnapshot
. I think really the separation is for ensure fast feedback.
I assume eventually we will have enough snapshot tests we will want to separate it out, but probably fine to start unified and only separate out if it gets too slow.
I might leave it as zig build snapshot
for now... at least for this PR
I can imagine we'll iterate on the design a few more times at least
I feel like it would be pretty redundant with all of the snapshot tests. Not bad to add a few of, but it is probably preferable to avoid too many scattered tests
My intention would be to focus on focusing on the CLI/orchestration logic that wouldn't be tested by the snapshots at all, so I don't think there would be much overlap there. I wouldn't try to make it exhaustive, just for some confidence that changes to the CLI are correct without having to manually test everything.
Ah, I thought you meant specifically formatting tests. Like another suite of tests but just for formatting.
But like tests that format examples/stdlib (folder vs file vs invalid) or exercise various cli args sound great
Right, that's what I'm thinking :smiley:
Anyone looking for something to help out with, we should set up our snapshot tool to run in CI with the other zig things.
I haven't started looking at this yet. But it's ready to be added now.
I think the next thing I'd like to tackle is parsing just an expression (not a full file), and translate some of our syntax snapshot tests across to work on improving coverage there.
Maybe look at how we might handle multi file snapshots... :thinking:
we should set up our snapshot tool to run in CI
What are the commands to use it?
zig build snapshot
zig build snapshot -- --verbose
also
I'll get on that after the roc nix package
zig build snapshot
updates the golden values right? Why do we want to run it in CI?
You run it, then a few extra git commands to ensure that none of the snapshot files changed
If any change or if new ones pop into existence, it is a failure
Ah I see. I was imagining that there would be a command that would explicitly check that the snapshots matched the current output without requiring you to manually diff them
Small improvement to sexpr pretty printing: https://github.com/roc-lang/roc/pull/7659
What approach should we take for snapshots that are just a single expression...
For example in crates/compiler/test_syntax/tests/snapshots/pass/add_var_with_spaces.expr.roc
we have a single expression
x + 2
Would we want snapshots this small, i.e. without being a full module?
I was thinking of migrating these across one by one as snapshots and using them to help add parser functionality.
Or perhaps we would want something like this instead... so turn them into a full module.
~~~META
description=Add a variable with spaces
~~~SOURCE
module [add2]
add2 = x + 2
~~~PARSE
(file
(header 'add2')
(decl
(ident 'add2')
(binop
'+'
(ident '' 'x')
(int '2'))))
Also -- minor tangent while I'm looking at it... would we prefer (binop '+' <lhs> <rhs>)
... or switch on the token and make it more like (plus <lhs> <rhs>)
I'd vote for the former (AKA (binop ...)
) since it's more consistent with how the rest of the S-expr stuff seems to work, and it gets desugared immediately in the next stage, so it's minor anyway
Definitely want to be able to parse single syntax nodes (exprs, statements, header, etc)
Could just have type = Expr
in the META section of the snapshot
It's really cool seeing this stuff come alive...
~~~META
description=Various type annotations
~~~SOURCE
module []
foo : U64
bar : Thing(a, b, _)
baz : (a, b, c)
add_one : (U8, U16 -> U32)
main! : List(String) -> Result({}, _)
~~~PARSE
(file
(header)
(type_anno
'foo'
(tag 'U64'))
(type_anno
'bar'
(tag
'Thing'
(ty_var 'a')
(ty_var 'b')
(_)))
(type_anno
'baz'
(tuple
(ty_var 'a')
(ty_var 'b')
(ty_var 'c')))
(type_anno
'add_one'
(fn
(tag 'U32')
(tag 'U8')
(tag 'U16')))
(type_anno
'main!'
(fn
(tag
'Result'
(record)
(_))
(tag
'List'
(tag 'String')))))
Some simplifications I'm debating rn (just for the snapshots, not renaming tag unions in the zig source or anything)
type_anno
-> tann
ty_var
-> tvar
There's a lot of design space here. It's nice that this is all one way and so easy to update the snapshots.
type_anno
and ty_var
are a lot clearer, I wouldn't change them if we don't experience noticeable problems due to verbosity.
I'd just like to say in case anyone is wondering Re the current state of snapshots, because they look somewhat strange.
I've just been dumping things in there to help figure out syntax issues with the parser and seed the fuzzer etc.
The current snapshots are pretty expendable and we will probably blow them away soon or change them a lot when we start having a usable Can etc.
I think it's this output is really nice, but I _personally_ would prefer double quotes for strings. Are we using singles just to avoid having to worry about escaping?
Just wanted to show off a little... checkout this snapshot. I'm pretty happy with how this is starting to look. It's really easy to see at a glance where the issues are. You can feed inputs from fuzz crashes and dissect what the compiler is doing pretty quickly -- which has helped me find and fix a number of issues.
I'm loving the formatting for Problems also :smiley:
~~~META
description=fuzz crash
~~~SOURCE
H{o,
]
foo =
"on (string 'onmo %')))
~~~PROBLEMS
TOKENIZE: (2:4-2:4) AsciiControl:
]
^
TOKENIZE: (1:6-1:6) Expected the correct closing brace here:
]
^
TOKENIZE: (6:7-6:36) UnclosedString:
"on (string 'onmo %')))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PARSER: missing_header
~~~TOKENS
UpperIdent,OpenCurly,LowerIdent,Comma,Newline,CloseCurly,Newline,LowerIdent,OpAssign,Newline,StringStart,StringPart,EndOfFile
~~~PARSE
(file
(malformed_header 'missing_header')
(record (field 'o'))
(decl
(ident 'foo')
(string 'on (string 'onmo %')))')))
~~~FORMATTED
{ o }
foo = "on (string 'onmo %')))"
~~~END
Anthony Bullard said:
I think it's this output is really nice, but I _personally_ would prefer double quotes for strings. Are we using singles just to avoid having to worry about escaping?
I changed it to double quotes in https://github.com/roc-lang/roc/pull/7672
Hey @Luke Boswell I'm dealing with an issue with Zig where TLDR we want to move to using TAB as the canonical indentation character in the formatter but Zig multiline string literals do not support using TAB in them. So long story short I need to move all of the current fmt tests to snapshots (I don't have to per se, but I think it's the logical move).
In order to do that I am moving to make snapshots support different IR nodes like the old test_syntax snapshots did. Are you aligned with this? It means that I need to expose some extra methods on the Parse IR so that we can output the SExpr and format correctly.
I'm not sure what you mean by
make snapshots support different IR nodes like the old test_syntax snapshots did
In general though, I think moving the fmt tests to snapshots is a great idea!
That was my intention too, but I figured you wanted the kitchen sink in there because it was easier to work with while developing the Parser initial implementation.
Luke Boswell said:
I'm not sure what you mean by
make snapshots support different IR nodes like the old test_syntax snapshots did
sorry i mean have it support snapshots of just exprs, statements, or headers instead of just modules
yeah, that is something we need to support anyway (imo)
Last updated: Jul 06 2025 at 12:14 UTC