Caching strategy for Roc (including canonicalization) · compiler development

Stream: compiler development

Topic: Caching strategy for Roc (including canonicalization)

Sam Mohr (Jan 15 2025 at 03:02):

So, if we're gonna want to save module results in a cache, how should we lay it out, and where should it be? I expect we might have multiple cache stages per module, for example:

canonicalization
partial typechecking
partial mono
etc.

Ideally we'd cache only the artifacts from the latest stage in the pipeline, but who knows? For now, let's assume there's only one build artifact per file, but this design is expandable if there are multiple files.

The main thing I want to do is figure out a way to prevent out-of-control cache growth by limiting each "file" to only hold one cache entry at a time. "File" is in quotes because the compiler ideally doesn't have to remember the path to a file, only the hash of its contents. The best way I can think to do that is to scan the list of Roc files every compilation and hash their contents, and then:

For those that are already in the cache, reuse their assets
For those files that don't have their artifacts cached, compile and save their artifacts
Every other artifact in the cache gets deleted

That means that we don't need to remember a relationship between filenames and their cached artifacts for user code. Unfortunately,this couldn't work with a global cache representing multiple user projects only work for single project caches, since we'd otherwise be clobbering all other user cached code on every compilation run. I can think of two solutions:

Every package gets its own Rust-like target/ equivalent called build/. Build contains all cache data per directory, as well as a build.lock file that gets created per run of a Roc compiler, and stores project cache artifacts in there. Package artifacts get compiled/saved/loaded on-demand from the global cache.
In the global cache, we take a hash of the main file for each package/app and create a folder for it. All files are stored in that folder. Each folder has a build.lock like the first option, and we only check that folder for reading/saving/deleting build artifacts.

I'd love to hear opinions on this, but I think the second option seems better because then all Roc artifacts can be stored in a single folder in the $HOME directory for a single machine: packages, build artifacts, and compiler versions. We'd definitely want a supplementary roc cache ... set of subcommands, with something like roc cache clean to remove old files.

For external packages, we'd definitely save everything in a global cache, both the source and the build artifacts. Since they'd be deterministically built given the same Roc version, we can just partition them per Roc version, and there's no need for a lock file since there's no deletion and compilation is idempotent. All in all, the second option would have this folder structure under ~/.roc/:

compiler/
  v0.1.0 (executable)
  v0.2.0
  <git hash for nightlies>
packages/
  github.com/
    lukewilliamboswell/
      roc-json/
        0.11.0/
          z45Wzc-J39TLNweQUoLw3IGZtkQiEN3lTBv3BXErRjQ.tar.br/
            src/
              <source files>
            build/
              <roc version>/
                <build artifacts by file hash>
    smores56/
      weaver/
        0.5.1/
          nqyqbOkpECWgDUMbY-rG9ug883TVbOimHZFHek-bQeI.tar.br/
            src/
              <source files>
            build/
              <roc version>/
                <build artifacts by file hash>
build/
  <hash of main.roc absolute file path>/
    build.lock
    <build artifacts by file hash>

Thoughts?

Luke Boswell (Jan 15 2025 at 03:33):

scan the list of Roc files every compilation and hash their contents

How fast is this? would it be beneficial if you could just ask the OS for some metadata like "last edited" or something and use that to skip reading the file?

Brendan Hansknecht (Jan 15 2025 at 04:48):

I think tools like edit time can be used to avoid some recomputations, but hashing is likely required to cut out a lot of work and skip a lot of invalidation.

Brendan Hansknecht (Jan 15 2025 at 04:49):

The theory is that hashing the file and looking up the key in the cache is a lot faster the rerunning parsing and canonicalization.

Richard Feldman (Jan 15 2025 at 04:49):

and constraint gen!

Richard Feldman (Jan 15 2025 at 04:49):

constraint gen can also be done as a pure function of source bytes

Richard Feldman (Jan 15 2025 at 04:51):

I definitely think we should only ever cache things in the home dir

Richard Feldman (Jan 15 2025 at 04:51):

no project-local cache dir ever

Richard Feldman (Jan 15 2025 at 04:51):

one reason for this is switching branches

Richard Feldman (Jan 15 2025 at 04:52):

if I'm switching back and forth between a few different branches, my cache shouldn't be invalidated

Richard Feldman (Jan 15 2025 at 04:52):

also if I'm switching between different projects, we should be able to reuse cache from their shared dependencies

Luke Boswell (Jan 15 2025 at 04:53):

Brendan Hansknecht said:

The theory is that hashing the file and looking up the key in the cache is a lot faster the rerunning parsing and canonicalization.

Yes, but is reading the contents of the file to recompute the hash, faster then looking up the hash previously computed (and unchanged as the file hasn't been edited) and then using that to get the correct cached data

Richard Feldman (Jan 15 2025 at 04:56):

I think we can think about speeding up the cache key determination separately

Richard Feldman (Jan 15 2025 at 04:58):

at some level we need a cache key, and source bytes are the ultimate source of truth for what we're caching here

Luke Boswell (Jan 15 2025 at 05:00):

The main thing I want to do is figure out a way to prevent out-of-control cache growth by limiting each "file" to only hold one cache entry at a time.

I guess this was the part I was thinking about... exploring ways to connect the hash and the files in a way that doesn't duplicate the cache artifacts each time the source changes (and it gets a new hash)

Luke Boswell (Jan 15 2025 at 05:01):

Though I admittedly haven't really explained any of the things I was thinking...

Richard Feldman (Jan 15 2025 at 05:01):

I think we can limit cache growth by having a "background job" that goes and deletes old cache files based on last access time

Richard Feldman (Jan 15 2025 at 05:02):

there are points where compilation gets bottlenecked and we can't productively use all the cores just because things are blocked, and during those times we can put all the idle cores on garbage-collecting old cache files until they're unblocked again

Richard Feldman (Jan 15 2025 at 05:03):

that shouldn't slow down builds anyway because the cores would have been idle anyway, and since all it has to do is to go through and look at access times (not even read any of the contents of the files) to decide if they should be deleted, it can probably get through a lot of them very quickly

Richard Feldman (Jan 15 2025 at 05:22):

going back to the original question, we can definitely cache typechecked modules too

Richard Feldman (Jan 15 2025 at 05:23):

basically just write down their exposed type annotations

Richard Feldman (Jan 15 2025 at 05:23):

their caches get invalidated more easily though, because if any of their dependencies' cached exposed types change, we have to recompute them

Richard Feldman (Jan 15 2025 at 05:24):

caching mono is potentially super valuable but also tricky because it's nonobvious where to cache the specializations

Richard Feldman (Jan 15 2025 at 05:24):

Ayaz and I have talked about this in the past

Sam Mohr (Jan 15 2025 at 05:38):

Unless we do two passes of type checking

Sam Mohr (Jan 15 2025 at 05:38):

Partially typecheck solo modules, then finish after combining the modules

Sam Mohr (Jan 15 2025 at 05:39):

The current plan for roc_can_solo and roc_can_combine to use the same AST should make that relatively easy

Richard Feldman (Jan 15 2025 at 05:39):

could be!

Sam Mohr (Jan 15 2025 at 05:39):

Compared to two different constraints modules

Brendan Hansknecht (Jan 15 2025 at 05:39):

The main thing I want to do is figure out a way to prevent out-of-control cache growth by limiting each "file" to only hold one cache entry at a time.

Mutiple copies in the cache likely is a good thing. It is common to work on multiple git branches that may have the same file in different states. So I don't think limiting to one entry per file is the right call

Sam Mohr (Jan 15 2025 at 05:40):

We'll have to try to know

Joshua Warner (Jan 15 2025 at 05:43):

Ideally the thing we're caching is really easy to serialize and deserialize. Prefering flat data structures to pointer-chasing, etc.

Sam Mohr (Jan 15 2025 at 05:44):

We prefer that today, it's the plan as far as I know going forward

Joshua Warner (Jan 15 2025 at 05:44):

The AST right now is very pointer-chase'y for sure.

Joshua Warner (Jan 15 2025 at 05:45):

I've also seen lots of things from deeper in the compiler that take ownership of things, etc

Sam Mohr (Jan 15 2025 at 05:45):

Oh yeah, I'm thinking about how constrain looks

Sam Mohr (Jan 15 2025 at 05:45):

Which is roughly where caching would happen

Sam Mohr (Jan 15 2025 at 05:45):

But you're right

Sam Mohr (Jan 15 2025 at 05:45):

How to cache the AST is a different question

Joshua Warner (Jan 15 2025 at 05:45):

I'm not actually sure we should, I was just using that as the example I'm most familiar with

Sam Mohr (Jan 15 2025 at 05:46):

Richard had a suggestion surrounding everything being in one big array

Richard Feldman (Jan 15 2025 at 05:46):

yeah if we're doing everything with indices into arenas (e.g. that's the idea in canonicalization) and we have 1 of those per arena, there is no deserialization step

Richard Feldman (Jan 15 2025 at 05:47):

you just read the bytes from the file into memory and you're done

Richard Feldman (Jan 15 2025 at 05:47):

it's essentially what Zig does

Richard Feldman (Jan 15 2025 at 05:47):

the downside is that everything has to be done in that one arena and with indices into it :sweat_smile:

Joshua Warner (Jan 15 2025 at 05:47):

Ahh interesting, so not even doing an SOA with a few types of arrays for different things

Richard Feldman (Jan 15 2025 at 05:47):

you can do that, but they all need to be SoA in the same arena

Richard Feldman (Jan 15 2025 at 05:48):

and then also all the metadata needs to be in the arena too, at the beginning

Richard Feldman (Jan 15 2025 at 05:48):

instead of e.g. on the stack

Richard Feldman (Jan 15 2025 at 05:48):

so it goes to disk too

Joshua Warner (Jan 15 2025 at 05:48):

ehe, reading 10 separate arrays is minimally different from reading 1 arena blob (or at least, that'd be my hypothesis)

Joshua Warner (Jan 15 2025 at 05:49):

If data can be organized more cleanly in a small number of SOA-style arrays, that might be a win

Richard Feldman (Jan 15 2025 at 05:49):

yeah but the hard part is making everything be all indices

Richard Feldman (Jan 15 2025 at 05:49):

and no pointers

Richard Feldman (Jan 15 2025 at 05:49):

means no recursive enums, for example

Joshua Warner (Jan 15 2025 at 05:50):

Nah, you have one top-level array per enum type (not enum variant - enum type)

Joshua Warner (Jan 15 2025 at 05:51):

You do end up doing indices then, but it's a more structured form of indices

Joshua Warner (Jan 15 2025 at 05:51):

Also I'm interested in exploring what SIMD-ification could be done when you have data in that sort of form

Richard Feldman (Jan 15 2025 at 05:52):

unfortunately the most expensive parts of the compilation are in the backend of the compiler, and they're also the most challenging to cache

Joshua Warner (Jan 15 2025 at 05:53):

I imagine if you ignore inlining, it's a relatively simple problem

Joshua Warner (Jan 15 2025 at 05:53):

(e.g. for a dev backend)

Joshua Warner (Jan 15 2025 at 05:53):

"relatively" is doing a fair mount of work there ;)

Richard Feldman (Jan 15 2025 at 05:54):

the specializations are the hard part

Joshua Warner (Jan 15 2025 at 05:54):

Ahhh

Joshua Warner (Jan 15 2025 at 05:56):

I was thinking of:
(1) make caching work for cases where specialization is not required
(2) where strategically possible, reduce the need for specialization by masking types - e.g. a data structure that only ever deals in pointers of a generic data type, that code doesn't actually need to be specialized on the generic type; just have a GenericPointer that you compile for

Joshua Warner (Jan 15 2025 at 05:56):

But yes, that does remove the ability to effectively cache a lot of interesting code

Joshua Warner (Jan 15 2025 at 05:58):

The sort of thing Swift does when compiling generic code into a binary

Joshua Warner (Jan 15 2025 at 05:59):

IIRC .NET will also do some even fancier things like pre-compiling a version of some machine code that's agnostic to things like field offsets, and then do "late patching" in the real offsets after those are resolved

Richard Feldman (Jan 15 2025 at 06:00):

yeah the thing I've heard about what Swift does (which I don't know the details of) is that it's good for caching and ABI stability but very technically thorny

Joshua Warner (Jan 15 2025 at 06:09):

It also doesn't allow inlining across module boundaries, which isn't ideal

Sam Mohr (Jan 15 2025 at 06:15):

Any problems with the cache structure I laid out above? This seems orthogonal enough that someone else could work on this in parallel if they wanted to.

Sam Mohr (Jan 15 2025 at 06:17):

I can say in the issue that whoever picks it up should expect discussion when they make a PR

Joshua Warner (Jan 15 2025 at 06:19):

I'm a little wary of the build.lock (I feel like I pretty regularly ran into issues with cargo's version of that for a _long_ time before they polished it up)

Joshua Warner (Jan 15 2025 at 06:20):

One thing you may have to be careful of is windows compat issues with path length

Joshua Warner (Jan 15 2025 at 06:20):

It looks like those paths can get pretty long

Joshua Warner (Jan 15 2025 at 06:21):

One thing that crossed my mind is instead of caching on the filesystem, you could use something like sqlite

Joshua Warner (Jan 15 2025 at 06:22):

If you do that on the right scope, you could make roc build --clear-cache or whatever be really fast (just deleting a handful of db files), rather than thousands of build artifacts

Richard Feldman (Jan 15 2025 at 06:23):

yeah I don't think we should need a build.lock-type thing

Sam Mohr (Jan 15 2025 at 06:33):

Should be deterministic for file output, just have to make sure that two processes don't write to the same file at the same time. Is that a problem? Seems like writing to a random file in /tmp/ would mean that we only have to move a file to the cache dir

Sam Mohr (Jan 15 2025 at 06:33):

But that's extra work that we might not have to do

Sam Mohr (Jan 15 2025 at 07:49):

Mutiple copies in the cache likely is a good thing. It is common to work on multiple git branches that may have the same file in different states. So I don't think limiting to one entry per file is the right call

@Brendan Hansknecht this is tricky, since it'd be nice to have a deterministic and "close to pure" (meaning using few parameters) means for caching. So making something that doesn't need a coordinated strategy where we read from a file that lists the last N files would be great

Sam Mohr (Jan 15 2025 at 07:50):

I agree that I am suggesting that we trade off performance for cache size here, but I think the performance improvement isn't that important here since the cached artifacts are for the fast part of the compiler anyway

Sam Mohr (Jan 15 2025 at 07:51):

So we should start with something that is easy to implement correctly (which I believe the above strategy would be), and then we can try to make this cache more files in the future

Sam Mohr (Jan 15 2025 at 08:27):

Do we want to trust that packages on the host system have not been modified? For example, if I download the code for the latest version of Weaver to a folder named ~/.roc/packages/github.com/smores56/weaver/0.5.1/nqyqbOkpECWgDUMbY-rG9ug883TVbOimHZFHek-bQeI.tar.br/src/..., I'd ideally want to keep it unarchived in our system to avoid the need to decompress the archive every time. But what if someone edited the hash in their Roc app for the package and also the hash in the global Roc cache?

Sam Mohr (Jan 15 2025 at 08:27):

This seems most likely to be self-inflicted

Sam Mohr (Jan 15 2025 at 08:28):

But the safe bet is to just save the full archive and decompress it for now, and then try to avoid the decompression cost down the road

Sam Mohr (Jan 15 2025 at 08:29):

Unless we don't care about this security concern, which would likely be because we don't think it's a real concern

Sam Mohr (Jan 15 2025 at 08:51):

Created a detailed issue for this, would appreciate a review if someone gets a chance: https://github.com/roc-lang/roc/issues/7517

Sam Mohr (Jan 15 2025 at 08:52):

I wrote it based on my current understanding of the plan, but I am happy to change the details based on the results of this discussion

Sam Mohr (Jan 15 2025 at 08:52):

I just thought it'd be good to get everything written down while it was fresh in my mind.

Richard Feldman (Jan 15 2025 at 12:45):

hm, why would we have the project directory structure in the cache dir instead of just a flat, un-namsespaced collection of hashes?

Richard Feldman (Jan 15 2025 at 12:46):

in other words, instead of:

build/<project hash>/<roc version>/<file content hash>

why not

build/<roc version>/<file content hash>

Richard Feldman (Jan 15 2025 at 12:48):

also I do think we should do the "write to tmpdir and then move afterwards" thing - otherwise if multiple roc compiler processes are running at the same time, there can be race conditions around partially-written files in the cache :big_smile:

Richard Feldman (Jan 15 2025 at 12:49):

~/.roc/packages/<repository website>/<username>/<project name>/<version>/<archive hash>/...

we already have a format for these, which I think should stay as-is!

Richard Feldman (Jan 15 2025 at 12:54):

regarding decompression, I don't think there's a real security benefit to leaving the files compressed. If an attacker has gotten write access to that directory, they can just write fake cache files directly which do whatever malicious thing they want. I think it's better to leave them decompressed so we don't have to keep redoing that work!

Richard Feldman (Jan 15 2025 at 12:55):

(also regarding the comment on the issue about XDG - we already do that for our cached package downloads! There's a whole algorithm we use to determine where the roc cache dir should go.)

Brendan Hansknecht (Jan 15 2025 at 16:45):

Sam Mohr said:

So we should start with something that is easy to implement correctly (which I believe the above strategy would be), and then we can try to make this cache more files in the future

Yes, but we need to make sure we design in ways that enable more flexibility in the future. Avoid simplify so much that we design ourselves into a corner and need a large rewrite to enable more functionality.

Brendan Hansknecht (Jan 15 2025 at 16:46):

Also, we probably should talk to the zig folks about this. I think they just went through tons of incremental build work. I would guess they can tell us a good bit about pitfalls

Brendan Hansknecht (Jan 15 2025 at 16:49):

Richard Feldman said:

I definitely think we should only ever cache things in the home dir

I really dislike this. It makes it very unclear which projects are eating up diskspace. I much prefer all artifacts from a single project in a single location. That is much easier to cleanup and understand

Richard Feldman (Jan 15 2025 at 17:35):

that's interesting

Richard Feldman (Jan 15 2025 at 17:35):

I hadn't thought of that

Sam Mohr (Jan 15 2025 at 18:07):

hm, why would we have the project directory structure in the cache dir instead of just a flat, un-namespaced collection of hashes?

The goal with this is to know which build artifacts belong to which project on the user's system. If we have them in one big folder, then we can't just cull any build artifacts that don't correspond to current project files, because we'd delete build artifacts for all other projects.

I preferred a hash of the main.roc file because that shouldn't move much, and it keeps our build cache flatter.

We should probably offer a roc cache project-path command that would return the path to the current project's ~/.roc/build/<hash>/ path so you can do "$(roc cache project-path)" for when you want to check the cache dir in scripts.

Sam Mohr (Jan 15 2025 at 18:13):

This could be accompanied by a roc cache project-clean command, pretty easy to clean the project with that

Sam Mohr (Jan 15 2025 at 19:08):

Updated the description in the issue:

Used the current XDG-compliant cache directory discovery mechanism
Used the current package folder format, which is less deeply nested than the prior suggestion
Store downloaded package source files uncompressed instead of in an archive
Write cache artifacts to a temp file first to avoid write conflicts and remove the need for lock files

Richard Feldman (Jan 15 2025 at 20:51):

Brendan Hansknecht said:

Richard Feldman said:

I definitely think we should only ever cache things in the home dir

I really dislike this. It makes it very unclear which projects are eating up diskspace. I much prefer all artifacts from a single project in a single location. That is much easier to cleanup and understand

the flip side of this is that it's undesirable for Roc scripts to clutter the local directory with a cache dir, but still desirable for them to have caching for repeat run perf.

I also like the idea of Roc not needing a .gitignore because by default it just doesn't create any local stuff you're supposed to ignore

Richard Feldman (Jan 15 2025 at 20:51):

what about having commands like roc cache clean and roc cache size to give you insight into that?

Richard Feldman (Jan 15 2025 at 20:53):

if you want to get that info across a bunch of projects on disk, it's probably about as much work to ask Claude to write you a shell one-liner to go run that command on all the projects as it would be to get all the sizes of one of their subdirectories (if cache dirs were local to the project)

Brendan Hansknecht (Jan 15 2025 at 21:56):

Yeah, I think the biggest issue comes when you don't realize some old project (potentially otherwise deleted) is wasting a ton of cache space. Cause you don't actually want to just delete the whole cache. And you also don't want to have to manually remember some random old project. It's easy to see a projects directory is wasting space (see this all he time rust). Would be much harder to dig into the same with central caching. Obviously tooling can make it work, but I prefer the file system to just be that tool instead of needing to learn new tooling.

Jasper Woudenberg (Jan 15 2025 at 22:01):

Want to quickly add a +1 to the idea of Roc not putting cache files in the project directory. I think in particular for Roc to be as nice for scripting as a dedicated scripting language, a small part of that is that it doesn't create a bunch of helper files in the project directory.

Jasper Woudenberg (Jan 15 2025 at 22:08):

For cache management, I wonder if there's some nice automated heuristics that would mean that Roc is respectful of disk space without needing active management. Something like:

Roc will automatically remove caches that are unused for a certain number of days.
Roc wil evict longest-unused-caches if the total amount of cache space Roc uses exceeds a certain percentage of disk space.
Same as previous point, but roc will evict the worst effort-to-size-ratio cache objects instead of longest-unused.

Such heuristics might be easier to implement if the entire cache lives in a single place. If there's bits of Roc cache strewn about the file system, then Roc won't know where all the cache is, and so won't be able to manage the whole either. At that point manually managing the cache will be the user's only option.

Brendan Hansknecht (Jan 15 2025 at 22:09):

If we add strategies like this, we should make them opt in.

Brendan Hansknecht (Jan 15 2025 at 22:09):

Or at a minimum opt out

Jasper Woudenberg (Jan 15 2025 at 22:09):

You think?

Brendan Hansknecht (Jan 15 2025 at 22:09):

Disk space is practically free and I have seen plenty of cases where it is preferable to just eat a ton of it.

Brendan Hansknecht (Jan 15 2025 at 22:10):

Also, doing small deletions on every cache use will be bad for performance.

Brendan Hansknecht (Jan 15 2025 at 22:11):

Obviously some solutions can still be grouped, but I think this is why tools like nix just do a large GC call to cleanup caching. Gives the user control.

Luke Boswell (Jan 15 2025 at 22:11):

Brendan Hansknecht said:

Or at a minimum opt out

I'd vote for this. We can have sensible defaults for the common path and experience.

Brendan Hansknecht (Jan 15 2025 at 22:11):

I imagine that I wouldn't even think about my roc cache size until it is at least 10GB. Depending on the machine, probably closer to 100GB

Jasper Woudenberg (Jan 15 2025 at 22:13):

I think those are important aspects, but wonder if they can be folded into a "sufficiently smart" cache management strategy :sweat_smile:.

For instance, maybe the amount of disk Roc is okay using is not relative to total disk space, but to free disk space. And deletions can happen in a background job, I think Richard mentioned a daemon before.

Nix is an interesting case. I used to manually run a nix garbage-collect job periodically when it occurred to me or, more likely, when I ran into some disk-full-related errors.

At some point I enabled a "automatically garbage collect old stuff on every nixos-rebuild" and I've not actively needed to worry about Nix disk space usage since. It's been much nicer!

Luke Boswell (Jan 15 2025 at 22:14):

One idea might be to do the build, and then follow up with the review/cleanup

Brendan Hansknecht (Jan 15 2025 at 22:14):

Jasper Woudenberg said:

I think those are important aspects, but wonder if they can be folded into a "sufficiently smart" cache management strategy :sweat_smile:.

Yeah, I totally think this is doable. There can be sane defaults and a config in the cache folder to give more control.

Richard Feldman (Jan 15 2025 at 22:21):

Richard Feldman said:

I think we can limit cache growth by having a "background job" that goes and deletes old cache files based on last access time

there are points where compilation gets bottlenecked and we can't productively use all the cores just because things are blocked, and during those times we can put all the idle cores on garbage-collecting old cache files until they're unblocked again

yeah I think this would be the nicest way to do it if it can work - no daemon, just make use of idle cores during every build to quietly go around deleting expired cache entries until either it runs out or the build needs the core again

Sam Mohr (Jan 15 2025 at 22:31):

In that case, we could get away with not needing to stick to a build/<project main.roc hash>/<roc version>/<file hash> strategy and just go for build/<roc version>/<file hash>

Sam Mohr (Jan 15 2025 at 22:32):

I still think the former is a very simple strategy that will work until we figure out our cleaning strategy

Sam Mohr (Jan 15 2025 at 22:32):

But this isn't gonna be implemented for a while, so I don't think it matters yet

Isaac Van Doren (Jan 16 2025 at 23:34):

I would also much prefer to have all of the caching in a single directory and avoid a per-project build dir. I think heuristics about cleaning up the cache can probably solve the problem nicely, but even if they can't, deleting the whole cache once in a blue moon if it got too large would not be that big of a deal.

Sam Mohr (Jan 16 2025 at 23:35):

The only thing that deleting the entire ~/.cache/roc/ folder would break would be the compiler being missing

Sam Mohr (Jan 16 2025 at 23:35):

Everything else would survive

Sam Mohr (Jan 16 2025 at 23:36):

Even then, if we default to storing the current Roc bin in /usr/local/bin/ or something, then it wouldn't even break

Joshua Warner (Jan 21 2025 at 01:07):

I think the compiler install should go into ~/.local/roc rather than /usr/local/bin/ or ~/.cache/roc/

Joshua Warner (Jan 21 2025 at 01:08):

The semantics of the .cache dir are supposed to be that dropping it won't cause anything to break - but clearly if the roc compiler is stored in there and you drop it, you'll clearly have broken your workflow.

Richard Feldman (Jan 21 2025 at 01:45):

I remember there was some pushback, but this is one of the reasons I think we should not do symlinks for version switching of the roc executable itself, and instead it should update itself in-place

Richard Feldman (Jan 21 2025 at 01:45):

if it does that, then the downloads are actually just cache in case you want to switch back to that version, and there's no problem with deleting the cache dir

Anthony Bullard (Jan 21 2025 at 01:46):

I think ~/.local/bin/roc. At least that where I’d move it if it defaults to someplace else

Anthony Bullard (Jan 21 2025 at 01:47):

I have my path setup to make anything put there to work right away , or after a hash -r

Anthony Bullard (Jan 21 2025 at 01:57):

I think XDG_STATE_DIR is where a rocup like tool would store compiler versions

Anthony Bullard (Jan 21 2025 at 01:57):

And symlink from there

Richard Feldman (Jan 21 2025 at 02:38):

yeah :point_up: is what I think we shouldn't do :sweat_smile:

Sam Mohr (Jan 21 2025 at 03:31):

It's unusual to have the compiler also be the install tool, but the idea of having a single binary for literally everything is just MAGICAL

Sam Mohr (Jan 21 2025 at 03:31):

So yeah, symlinks are only needed if we can't get that working

Sam Mohr (Jan 21 2025 at 03:32):

Otherwise, Richard's plan is just objectively better in my eyes

Joshua Warner (Jan 21 2025 at 03:56):

Would we ever want to allow different projects to pin different versions of roc?

Sam Mohr (Jan 21 2025 at 03:56):

The plan is to allow configuring the version at the top of your main.roc

Joshua Warner (Jan 21 2025 at 03:56):

I think that’s something that has happened and literally every other program language community. It’s not obvious to be why roc would not also want to support that.

Joshua Warner (Jan 21 2025 at 03:57):

The language version and compiler version are two different things though

Sam Mohr (Jan 21 2025 at 03:57):

Hmmm

Joshua Warner (Jan 21 2025 at 03:57):

It may be important to pin compiler version.

Joshua Warner (Jan 21 2025 at 03:57):

(Not sure)

Sam Mohr (Jan 21 2025 at 03:57):

Could you give an example where a properly semvered language version couldn't do that?

Sam Mohr (Jan 21 2025 at 03:58):

I'm not sure what the difference is between a compiler and a lang version

Joshua Warner (Jan 21 2025 at 03:59):

If I were developing a big app I deploy to production, I'd sure as hell want to pin the exact compiler version

Joshua Warner (Jan 21 2025 at 04:00):

Too much risk of uncontrolled compiler bugs causing havoc

Joshua Warner (Jan 21 2025 at 04:00):

Better to have compiler upgrades be like any other commit that you can revert

Sam Mohr (Jan 21 2025 at 04:05):

Rust only does a semver, seems to work okay for AWS

Sam Mohr (Jan 21 2025 at 04:05):

Because that pins a single compiler version for their release schedule

Joshua Warner (Jan 21 2025 at 04:06):

At my company we pin an exact nightly version

Joshua Warner (Jan 21 2025 at 04:06):

I suppose that may be an artifact of being stuck on nightly

Sam Mohr (Jan 21 2025 at 04:06):

That pins a more specific set of features than would be available with the "every six weeks" release

Joshua Warner (Jan 21 2025 at 04:07):

And also a more specific set of bugs. There have definitely been times when we had to choose a different version to pin because our release pipeline picked up things that turned out to be compiler bugs

Sam Mohr (Jan 21 2025 at 04:07):

But it's not like you could get a different compiler with the same semver

Sam Mohr (Jan 21 2025 at 04:07):

Yeah, that's fair

Brendan Hansknecht (Jan 21 2025 at 04:35):

Even roc pins rust to an exact version despite semver suggesting we could use a newer version

Brendan Hansknecht (Jan 21 2025 at 04:35):

I think it is an exceptionally common use case

Sam Mohr (Jan 21 2025 at 04:38):

Then we can just allow setting a commit hash at the top of the file or something

Sam Mohr (Jan 21 2025 at 04:38):

The version at the top should be pointable at any release we have in GitHub releases

Sam Mohr (Jan 21 2025 at 04:39):

I was assuming that we'd make all of those different semvers (including nighties, somehow)

Sam Mohr (Jan 21 2025 at 04:39):

If that's not the case, then semver doesn't cut it, sure

Richard Feldman (Jan 21 2025 at 04:53):

one of the things in the design doc in #ideas > compiler version management is that you can pass a CLI flag to have roc run a different version of roc from your cached downloads (after downloading it first if necessary)

Richard Feldman (Jan 21 2025 at 04:54):

and also the same thing can be done in a .roc file

Richard Feldman (Jan 21 2025 at 04:54):

so no symlinking needed for that use case either! :big_smile:

Brendan Hansknecht (Jan 21 2025 at 05:06):

Interesting. So roc still auto updates to the lastest version, but it can run old version from the cache.

Richard Feldman (Jan 21 2025 at 05:28):

well I'd want the update to be manual/opt-in, but yeah

Anthony Bullard (Jan 21 2025 at 14:07):

Richard Feldman said:

yeah :point_up: is what I think we shouldn't do :sweat_smile:

This is a pretty common pattern for tooling. And I've seen (in my own company) the "we'll defer to a different global version from a local version" fall on it's face pretty terribly. Now maybe I need to re-read this entire thread top to bottom to see what the fears are with symlinking for this use case (I'm sure if you are making the argument, it's well reasoned)

Richard Feldman (Jan 21 2025 at 14:16):

#ideas > compiler version management is the most relevant thread on this topic! :big_smile:

Anthony Bullard (Jan 21 2025 at 15:09):

So the argument is "we want Roc to only require a single executable download (directly by the user) for it to run Roc apps targeting any version"?

Anthony Bullard (Jan 21 2025 at 15:11):

If so, then I agree that this makes sense. I've just never seen this work out as well as imagined here. But maybe because those tools were written in Javascript - and most of the issue have more to do with issues with node module resolution than the concept itself. As long as args, pipes, and errors are linked up correctly and the user has final discretion over changes to the filesystem, I guess it should be fine.

Anthony Bullard (Jan 21 2025 at 15:13):

Do you know of another language toolchain that works like this? This is like rustup, cargo, and rustc in a single executable.

Richard Feldman (Jan 21 2025 at 15:56):

I don't know of any toolchain that works like this

Richard Feldman (Jan 21 2025 at 15:56):

but I don't think there are any technical barrier to it, just one of those "nobody has done it until someone is the first to do it" things :big_smile:

Anthony Bullard (Jan 21 2025 at 16:21):

Awesome. That’s Roc innovation !

Anthony Bullard (Jan 21 2025 at 16:22):

Now all we need to have a “version” of Roc :wink:

Richard Feldman (Jan 21 2025 at 16:37):

yeah, as noted in the doc, I'd like to get this in place before 0.1.0 because that way in the future you can use it to switch your current roc all the way back to 0.1.0 without getting stuck and being able to go forward again

Anthony Bullard (Jan 21 2025 at 17:53):

Are we going to have a strategy for compiler commit SHAs/tags as well? And local dev? I guess I can just try to read through all of this and the other thread when I’m done with work

Anthony Bullard (Jan 21 2025 at 17:59):

I see the former in there, but not the latter. Obviously that is an edge case for compiler devs mostly

Richard Feldman (Jan 21 2025 at 18:46):

I mentioned nightlies in the doc

Brendan Hansknecht (Jul 14 2025 at 03:22):

A few comments on caching:

Why do we default to /Users/bren077s/Library/Caches/roc instead of ~/.cache? I don't think I have ever seen cli tools do that. Super weird to me.
I don't have time to fix this now, but the cache does not clean up resource properly. If we load a file from the cache, we will then still try to deinit like normal This leads to calling free on a pointer that was not allocate. We need to properly cleanup mmaps and what not.
Even if we fail to compiler a file and it has tons of errors we still cache it. This is maybe fine, but at a minimum, we also need to cache all the errors. Currently the first call to roc check prints all the errors and caches. The second just loads from cache and prints no errors.

Luke Boswell (Jul 14 2025 at 04:34):

All good points. The current design needs some love - but it's a good start :smiley:

Brendan Hansknecht (Jul 14 2025 at 13:56):

Oh yeah, of course....just want to make sure to make sure to share the issues before I forget about them

Last updated: Jul 26 2025 at 12:14 UTC