Canonicalization overhaul - overview · compiler development

Stream: compiler development

Topic: Canonicalization overhaul - overview

Sam Mohr (Jan 13 2025 at 22:32):

For those of you that don't know, I am working on reworking the canonicalization code in roc_can for a few reasons:

We have lots of language changes that need to be implemented, and it'll be faster to just implement those all in one go instead of incrementally. I'm talking:
- static dispatch
- removing tracking of lambda sets/captures (this will be done later)
- setting up for the rest of the new compiler pipeline
- and more...
We want to support caching by making canonicalization deterministic
It's some of the oldest code in the compiler. It generally works pretty well, but it could use some love

In order to achieve these goals, the plan is to break roc_can into two new crates:

roc_can_solo: Canonicalize a module pretending that no other modules exist. Imports are treated as things that might or might not exist, stuff like that. This will be doing the majority of the work in canonicalization. When this is finished, its output will be 100% determined by the source code, independent of where it is in the filesystem. That way, we can cache this output keyed by a hash of the files contents, very simple and performant.
roc_can_combine: Take the outputs of the roc_can_solo crate and stitch them together, including:
- resolving imports
- resolving ingested files (e.g. import "path/to/file.json" as data : List U8)
- resolving aliases
- etc.

Once roc_can_combine is finished, we can pass the results to the type checker as before. In the future, we can even do partial typechecking on the values in each solo module, which should not even require that many changes to roc_constrain, but I don't know how feasible that is.

Some nice things that will come out of this change:

This new pair of cans should be much more robust, as I plan to never crash. As was suggested by Richard and used by Agus in the new monomorphization code, we can make a new type of problem be a CompilerProblem, and all places where we "need to crash" can just use one of those as a runtime error-generating AST node. This is more threading in the compiler than before, but it's way safer.
We are now putting everything into bump allocators, and will eventually ensure we don't do global allocations by making these crates #![no_std], which makes that impossible. By having all canonicalization info under the same bump allocator, we get caching for free, not to mention a good performance improvement!
I'm trying to put all work post-parsing into these crates, meaning that much less of the work is in load_internal. This should make canonicalization more modular than before.

The plan for implementing this change comes down to me understanding the final outcome's shape well enough to implement it in small, peer-reviewable chunks. To that end, I'm working in this branch of my fork with a machete to roughly shape a copy of roc_can into the right shape. At the same time, I have a markdown document on my machine where I'm writing down what the "plain English" recipe for a solo module canonicalization, and then a combined one, will look like. Once I have those together, I'll start making PRs!

Feel free to ask any questions.

Joshua Warner (Jan 13 2025 at 22:35):

Awesome!!!

Joshua Warner (Jan 13 2025 at 22:35):

Think I'll take a pause on trying to resolve can panics for the moment then

Sam Mohr (Jan 13 2025 at 22:40):

This is really my main focus when working on Roc at the moment, but I am just one person. If anyone is concerned by the likely slowdown to implementing static dispatch in a big bundle with all of these other changes, I understand and partly share your concern. Since this doesn't immediately slot into the rest of the compiler plan, it might be a couple months for us to have static dispatch available. So if someone really wants it now, I'd be happy to talk about how we can parallelize this work

Sam Mohr (Jan 13 2025 at 22:41):

Otherwise, I'm happy to do this myself. It's really so enriching to get to put blood, sweat, and tears into what will be the best language someday

Luke Boswell (Jan 13 2025 at 22:48):

Ah man, you're taking all the fun jobs :smiling_face:

Luke Boswell (Jan 13 2025 at 22:49):

I volunteer to help find all the bugs you leave behind

Sam Mohr (Jan 13 2025 at 22:51):

Oh yeah, I forgot lmao. When I was writing the snake_case conversion, I wanted to break a test intentionally and write panic!("Luke, you're my only hope") or something.

Sam Mohr (Jan 13 2025 at 22:51):

I'll make sure to do that for this next set of PRs just for you

Anton (Jan 14 2025 at 10:15):

We have lots of language changes that need to be implemented, and it'll be faster to just implement those all in one go instead of incrementally. I'm talking:

static dispatch

removing tracking of lambda sets/captures (this will be done later)

setting up for the rest of the new compiler pipeline

and more...

These are not like the syntax changes, it seems like difficult bugs are a real possibility here. For debugging it can help a lot if you only need to consider a small set of changes. Are you sure this will be faster if you include potential debug time for issues that may pop up from the entire Roc ecosystem?

Sam Mohr (Jan 14 2025 at 10:19):

The plan is not to make one giant PR with all changes included, but to make the changes incrementally on a new canonicalization code

Sam Mohr (Jan 14 2025 at 10:19):

I just have no idea what it'll look like yet. What change do I make first?

Sam Mohr (Jan 14 2025 at 10:21):

I don't know how to make these changes in steps. One benefit to this approach is that I get to ignore a lot of features to begin with. I'm starting without tracking lambda sets, without module params, etc.

Anton (Jan 14 2025 at 10:21):

changes incrementally on a new canonicalization code

Can you explain this in more detail?

Sam Mohr (Jan 14 2025 at 10:22):

If someone would know how to do this without such a nuclear option that wouldnt take 6 months, it'd be great to hear

Sam Mohr (Jan 14 2025 at 10:22):

I'm planning on modelling my changes on the strategy Agus has been taking with the new monomorphization code, more or less

Sam Mohr (Jan 14 2025 at 10:23):

Start by outlining the end shape, and leaving a whole lot of "implement this and TODO" in places where it's obvious what needs to happen

Sam Mohr (Jan 14 2025 at 10:24):

That can help us start with PRs that other people can understand

Sam Mohr (Jan 14 2025 at 10:24):

To help with the robustness of this, I think a very important step will also be defining a testing strategy for all of these features

Anton (Jan 14 2025 at 10:25):

Makes sense!

Sam Mohr (Jan 14 2025 at 10:26):

I've not dug too deeply into that side of things, but the canonicalization testing today is mostly testing individual warnings here or there, and a lot of desugaring testing.

Sam Mohr (Jan 14 2025 at 10:27):

We'll need to figure that out as well. My hope is that we can do more unit testing on "just canonicalize this alias", not "create a whole module with aliases and check the problems that arise"

Sam Mohr (Jan 14 2025 at 10:27):

That should make it more readable, and more modular

Sam Mohr (Jan 14 2025 at 10:28):

Probably once I figure out the overall plan, I'll try to write it up in more detail and jump on a call with someone. That will give me an opportunity to make sure that there isn't a big hole in it somewhere.

Sam Mohr (Jan 14 2025 at 10:32):

I think the main steps before I can start drafting an outline for PR are:

make sure that var and shadowing works with the new scope plan
make sure that type checking for methods (and *.func's, what should we call these?) will work
Figure out if we should have a separate IR for roc_can_solo and roc_can_combine
Figure out if we can/should do typechecking after solo canonicalization, or if we can only do it with the combined and canonicalized modules
And maybe check the plan for handling builtins in a solo context, but that's not really blocking

Luke Boswell (Jan 14 2025 at 22:49):

If someone would know how to do this without such a nuclear option that wouldnt take 6 months, it'd be great to hear

I've been thinking about this.

We will be in a position with two Can stages, the current (legacy) one, and the (new) one being developed. Both of these take the same input, Parser AST... and eventually produce the same output, Mono IR?? right?

Can we wire up a test harness that can feed the same input in, and confirm it's getting the same output?

Starting with the most basic of expressions, but over time as the new Can implementation matures we can add tests and eventually be in a position where we have feature parity.

Luke Boswell (Jan 14 2025 at 22:50):

Or maybe there is a way to use the fuzzer, and incrementally add supported AST nodes

Sam Mohr (Jan 14 2025 at 22:51):

The same output won't come out because of a number of changes. Static dispatch for instance

Luke Boswell (Jan 14 2025 at 22:51):

This won't catch all the new features... but maybe it helps get us something we can use sooner

Sam Mohr (Jan 14 2025 at 22:51):

And I'm avoiding supporting abilities

Sam Mohr (Jan 14 2025 at 22:51):

But yes

Sam Mohr (Jan 14 2025 at 22:52):

For those things in common, it should output the same Mono IR

Sam Mohr (Jan 14 2025 at 22:52):

Well...

Sam Mohr (Jan 14 2025 at 22:52):

Lambda sets are getting built differently as well

Sam Mohr (Jan 14 2025 at 22:53):

In that they're supposed to be built later in the compiler

Luke Boswell (Jan 14 2025 at 22:53):

Even without Abilities and Lambda sets... we could still cover a lot of the AST though

Sam Mohr (Jan 14 2025 at 22:53):

Probably

Sam Mohr (Jan 14 2025 at 22:53):

Worth a try to make sure we're on the right track

Luke Boswell (Jan 14 2025 at 22:56):

If we had the new Can module (even just stubbed out) @Joshua Warner might be able to help with the test harness

Sam Mohr (Jan 14 2025 at 22:56):

Sure!

Sam Mohr (Jan 14 2025 at 22:56):

I think I'd be able to get something in the next few weeks as a stub

Luke Boswell (Jan 14 2025 at 23:02):

Sam Mohr said:

And I'm avoiding supporting abilities

Is there a way we could rip this out of current Can, and make it another pass to the side, or move it to the end or something? Basically... could we do something now so we can keep the current impl and then it could be compatible with the new Can?

And I guess lambda sets are in the same boat

Joshua Warner (Jan 14 2025 at 23:04):

In my professional experience, it can be very very tempting to do a rewrite _and_ make significant functionality changes at the same time, but it's almost always a terrible idea

Luke Boswell (Jan 14 2025 at 23:05):

Yeah, I'm trying to find ways we can keep everything online and enable an incremental approach.

Sam Mohr (Jan 14 2025 at 23:06):

That's why I think the first step is to try to understand the end state, and then write down a plan that outlines what things should look like at the end state, and then break that into incremental changes as much as is feasible

Luke Boswell (Jan 14 2025 at 23:07):

It's also ok to start and change course along the way. More of a discovery or R&D type approach than an up front engineering effort

Sam Mohr (Jan 14 2025 at 23:08):

Yes, I'd call this the R&D stage for sure

Sam Mohr (Jan 14 2025 at 23:08):

An idea: between roc_can_solo and roc_can_combine, the latter is basically what we do today, but separated. We can maybe start by making a very small roc_can_solo that only does a little bit of work, and then passes everything else to the old roc_can

Sam Mohr (Jan 14 2025 at 23:09):

And eventually we move as much as possible to roc_can_solo until it all works

Sam Mohr (Jan 14 2025 at 23:09):

So step two would be to figure out the caching mechanism and roughly set that up

Sam Mohr (Jan 14 2025 at 23:09):

And step one is to do the prep work of making roc_can ready for this work

Sam Mohr (Jan 14 2025 at 23:10):

Meaning moving to use arenas as much as possible, changing names of things, using CompilerProblems where possible

Joshua Warner (Jan 14 2025 at 23:10):

What if we did something like:

Introduce a new "desugared AST" type, which initially looks a lot like the current AST
Start caching that
Incrementally evolve from there as necessary

Luke Boswell (Jan 14 2025 at 23:10):

Yeah, so for anything low technical maturity/R&D I would highly recommend taking a more agile/incremental approach -- keeping everything online and running ops normal.

I think the biggest risk here is the unkown-unkowns (sorry for the cliche's).

Sam Mohr (Jan 14 2025 at 23:11):

Yeah, Josh's suggestion is basically what I was expecting. I can try it

Sam Mohr (Jan 14 2025 at 23:12):

The subtle difference is that I think that roc_can_solo and roc_can_combine will use the same IR

Joshua Warner (Jan 14 2025 at 23:14):

There's no reason that can't be the case eventually

Sam Mohr (Jan 14 2025 at 23:14):

Well, one option is for the new desugared IR to be a roc_can_solo::Expr that looks just like desugared IR to start with, but over time we change it bit by bit, and once roc_can_solo::Expr and roc_can_combine::Expr are the same thing, we can use roc_can_solo::Expr

Joshua Warner (Jan 14 2025 at 23:14):

I would try to get there incrementally tho

Richard Feldman (Apr 11 2025 at 17:48):

so now that the Frontend Masters stuff is wrapped up, I have a backlog of things I should be doing...but what I'm fired up to do instead is to write some Zig canonicalization code :grinning_face_with_smiling_eyes:

Richard Feldman (Apr 11 2025 at 17:48):

what's the current status of that? I have no idea how far along things are!

Richard Feldman (Apr 11 2025 at 17:48):

I'm assuming @Sam Mohr might know?

Joshua Warner (Apr 11 2025 at 18:04):

I have some local changes to implement sexprs for the can ir, planning on submitting a pr “soon”

Sam Mohr (Apr 11 2025 at 18:52):

I'd love to see Richard working on it! I don't have that much work done, but I can put it in a branch and see what comes out.

If it's not already obvious, I've been burned out on Roc development for like a month and I don't know how to fix it. I was hoping taking time to play games and not think about it would work, but nothing seems to be working... Life outside has been tough. I'll give an update soon when I have the energy to come back.

Sam Mohr (Apr 11 2025 at 18:53):

So yes, thank you Richard for picking up my slack!

Brendan Hansknecht (Apr 11 2025 at 19:20):

Sam Mohr said:

If it's not already obvious, I've been burned out on Roc development for like a month and I don't know how to fix it. I was hoping taking time to play games and not think about it would work, but nothing seems to be working... Life outside has been tough.

This is normal and something that might just take time or the right break/inspiration. My time invest in roc significantly varies month by month. Often times, it just takes a while to revive. Generally, certain things re-energize and inspire (like community events and longer vacations eg holidays).

Brendan Hansknecht (Apr 11 2025 at 19:21):

Take the time you need and don't worry about roc. It will keep moving and it will still be here when you get back.

Anthony Bullard (Apr 11 2025 at 19:29):

We will miss your presence in the chat, hope to see talk to you soon buddy

Sam Mohr (Apr 11 2025 at 19:29):

Thanks Brendan

Sam Mohr (Apr 11 2025 at 19:30):

Yeah, maybe when work chills out

Richard Feldman (Apr 11 2025 at 21:42):

yeah super normal feeling... please don't feel bad about it! You're welcome whenever you're feeling it, just drop in and we'll catch you up on whatever's been happening :heart:

Richard Feldman (Apr 11 2025 at 21:42):

and thanks for all your awesome contributions so far!

Richard Feldman (Apr 11 2025 at 22:11):

also @Sam Mohr I'm happy to start from a blank slate, so no need to push a WIP branch unless you really want to :big_smile:

Anton (Apr 12 2025 at 08:11):

Life outside has been tough.

It pains me to hear that, I hope things get better :hugging:

Joshua Warner (Apr 12 2025 at 23:48):

Adding sexpr formatting to the can IR: https://github.com/roc-lang/roc/pull/7737
Note that this is largely untested & doesn't get hit (yet)

Isaac Van Doren (Apr 13 2025 at 01:14):

I’m constantly impressed by how mature and emotionally healthy the Roc community is :heart:

Sam Mohr (Apr 13 2025 at 21:47):

Yeah, it's really nice to see

Sam Mohr (Apr 13 2025 at 21:48):

There's a selection bias in this group for people that are willing to give their free time for no pay to improve the state of programming

Sam Mohr (Apr 13 2025 at 21:48):

So how surprised can we be?

Anthony Bullard (May 17 2025 at 18:51):

I wonder if anyone has made any progress here?

Anthony Bullard (May 17 2025 at 18:52):

I'd really love to do enough to get a Hello World program to be able to run (in an interpreter)

Brendan Hansknecht (May 17 2025 at 18:54):

I know this PR is moving, but not sure how much else has moved: https://github.com/roc-lang/roc/pull/7772

Anthony Bullard (May 17 2025 at 18:57):

I'd love to sit with someone as they review this and learn how to even make it through it. It's just too much code in an area I am not an expert in for me to review with any sort of authority at the moment

Anthony Bullard (May 17 2025 at 18:57):

I've implemented a few simple type checkers, and unification once before (but not to completion). But this is a LOT

Brendan Hansknecht (May 17 2025 at 19:00):

Haha, this is an area of the compiler I tend to avoid. I don't feel like it is that complicated, but I have never felt like groking all the type checking pieces.

Richard Feldman (May 17 2025 at 19:15):

yeah I took a bunch of notes about that PR on the plane (no wifi, couldn't comment) - overall looks good, I just want to leave some comments

Richard Feldman (May 17 2025 at 19:15):

but yeah I'm planning to merge it this weekend!

Richard Feldman (May 17 2025 at 19:16):

I also have some cache serialization stuff that's close but needs some more work

Luke Boswell (May 18 2025 at 23:14):

I've been following along with all the commits. But I also don't feel qualified to really comment on it.

Jared Ramirez (May 19 2025 at 18:03):

Yeah, sorry the PR is so huge — I know it makes it really hard to digest. I debate breaking it up, just didn’t have time. I’ll make sure to chunk things up better in the future for ease of review.

Planning on looking at Richard’s comments in detail later today, but probably will merge as-is then open a follow up PR addressing comments this week.

And I’m happy to find a time with whoever is interested (@Anthony Bullard ?) and talk through it to share the knowledge!

Anthony Bullard (May 19 2025 at 20:35):

heck yeah! thanks Jared!

Luke Boswell (May 19 2025 at 23:04):

I'd love to join that discussion too

Anthony Bullard (May 20 2025 at 12:53):

I think if no one else is working on it, I'd like to try to set up desugaring. Should I start a new topic on that?

Anthony Bullard (May 20 2025 at 12:53):

I'm going to assume so :-)

Last updated: Aug 17 2025 at 12:14 UTC