So, if we're gonna want to save module results in a cache, how should we lay it out, and where should it be? I expect we might have multiple cache stages per module, for example:
Ideally we'd cache only the artifacts from the latest stage in the pipeline, but who knows? For now, let's assume there's only one build artifact per file, but this design is expandable if there are multiple files.
The main thing I want to do is figure out a way to prevent out-of-control cache growth by limiting each "file" to only hold one cache entry at a time. "File" is in quotes because the compiler ideally doesn't have to remember the path to a file, only the hash of its contents. The best way I can think to do that is to scan the list of Roc files every compilation and hash their contents, and then:
That means that we don't need to remember a relationship between filenames and their cached artifacts for user code. Unfortunately,this couldn't work with a global cache representing multiple user projects only work for single project caches, since we'd otherwise be clobbering all other user cached code on every compilation run. I can think of two solutions:
target/
equivalent called build/
. Build contains all cache data per directory, as well as a build.lock
file that gets created per run of a Roc compiler, and stores project cache artifacts in there. Package artifacts get compiled/saved/loaded on-demand from the global cache.build.lock
like the first option, and we only check that folder for reading/saving/deleting build artifacts.I'd love to hear opinions on this, but I think the second option seems better because then all Roc artifacts can be stored in a single folder in the $HOME directory for a single machine: packages, build artifacts, and compiler versions. We'd definitely want a supplementary roc cache ...
set of subcommands, with something like roc cache clean
to remove old files.
For external packages, we'd definitely save everything in a global cache, both the source and the build artifacts. Since they'd be deterministically built given the same Roc version, we can just partition them per Roc version, and there's no need for a lock file since there's no deletion and compilation is idempotent. All in all, the second option would have this folder structure under ~/.roc/
:
compiler/
v0.1.0 (executable)
v0.2.0
<git hash for nightlies>
packages/
github.com/
lukewilliamboswell/
roc-json/
0.11.0/
z45Wzc-J39TLNweQUoLw3IGZtkQiEN3lTBv3BXErRjQ.tar.br/
src/
<source files>
build/
<roc version>/
<build artifacts by file hash>
smores56/
weaver/
0.5.1/
nqyqbOkpECWgDUMbY-rG9ug883TVbOimHZFHek-bQeI.tar.br/
src/
<source files>
build/
<roc version>/
<build artifacts by file hash>
build/
<hash of main.roc absolute file path>/
build.lock
<build artifacts by file hash>
Thoughts?
scan the list of Roc files every compilation and hash their contents
How fast is this? would it be beneficial if you could just ask the OS for some metadata like "last edited" or something and use that to skip reading the file?
I think tools like edit time can be used to avoid some recomputations, but hashing is likely required to cut out a lot of work and skip a lot of invalidation.
The theory is that hashing the file and looking up the key in the cache is a lot faster the rerunning parsing and canonicalization.
and constraint gen!
constraint gen can also be done as a pure function of source bytes
I definitely think we should only ever cache things in the home dir
no project-local cache dir ever
one reason for this is switching branches
if I'm switching back and forth between a few different branches, my cache shouldn't be invalidated
also if I'm switching between different projects, we should be able to reuse cache from their shared dependencies
Brendan Hansknecht said:
The theory is that hashing the file and looking up the key in the cache is a lot faster the rerunning parsing and canonicalization.
Yes, but is reading the contents of the file to recompute the hash, faster then looking up the hash previously computed (and unchanged as the file hasn't been edited) and then using that to get the correct cached data
I think we can think about speeding up the cache key determination separately
at some level we need a cache key, and source bytes are the ultimate source of truth for what we're caching here
The main thing I want to do is figure out a way to prevent out-of-control cache growth by limiting each "file" to only hold one cache entry at a time.
I guess this was the part I was thinking about... exploring ways to connect the hash and the files in a way that doesn't duplicate the cache artifacts each time the source changes (and it gets a new hash)
Though I admittedly haven't really explained any of the things I was thinking...
I think we can limit cache growth by having a "background job" that goes and deletes old cache files based on last access time
there are points where compilation gets bottlenecked and we can't productively use all the cores just because things are blocked, and during those times we can put all the idle cores on garbage-collecting old cache files until they're unblocked again
that shouldn't slow down builds anyway because the cores would have been idle anyway, and since all it has to do is to go through and look at access times (not even read any of the contents of the files) to decide if they should be deleted, it can probably get through a lot of them very quickly
going back to the original question, we can definitely cache typechecked modules too
basically just write down their exposed type annotations
their caches get invalidated more easily though, because if any of their dependencies' cached exposed types change, we have to recompute them
caching mono is potentially super valuable but also tricky because it's nonobvious where to cache the specializations
Ayaz and I have talked about this in the past
Unless we do two passes of type checking
Partially typecheck solo modules, then finish after combining the modules
The current plan for roc_can_solo
and roc_can_combine
to use the same AST should make that relatively easy
could be!
Compared to two different constraints modules
The main thing I want to do is figure out a way to prevent out-of-control cache growth by limiting each "file" to only hold one cache entry at a time.
Mutiple copies in the cache likely is a good thing. It is common to work on multiple git branches that may have the same file in different states. So I don't think limiting to one entry per file is the right call
We'll have to try to know
Ideally the thing we're caching is really easy to serialize and deserialize. Prefering flat data structures to pointer-chasing, etc.
We prefer that today, it's the plan as far as I know going forward
The AST right now is very pointer-chase'y for sure.
I've also seen lots of things from deeper in the compiler that take ownership of things, etc
Oh yeah, I'm thinking about how constrain looks
Which is roughly where caching would happen
But you're right
How to cache the AST is a different question
I'm not actually sure we should, I was just using that as the example I'm most familiar with
Richard had a suggestion surrounding everything being in one big array
yeah if we're doing everything with indices into arenas (e.g. that's the idea in canonicalization) and we have 1 of those per arena, there is no deserialization step
you just read the bytes from the file into memory and you're done
it's essentially what Zig does
the downside is that everything has to be done in that one arena and with indices into it :sweat_smile:
Ahh interesting, so not even doing an SOA with a few types of arrays for different things
you can do that, but they all need to be SoA in the same arena
and then also all the metadata needs to be in the arena too, at the beginning
instead of e.g. on the stack
so it goes to disk too
ehe, reading 10 separate arrays is minimally different from reading 1 arena blob (or at least, that'd be my hypothesis)
If data can be organized more cleanly in a small number of SOA-style arrays, that might be a win
yeah but the hard part is making everything be all indices
and no pointers
means no recursive enums, for example
Nah, you have one top-level array per enum type (not enum variant - enum type)
You do end up doing indices then, but it's a more structured form of indices
Also I'm interested in exploring what SIMD-ification could be done when you have data in that sort of form
unfortunately the most expensive parts of the compilation are in the backend of the compiler, and they're also the most challenging to cache
I imagine if you ignore inlining, it's a relatively simple problem
(e.g. for a dev backend)
"relatively" is doing a fair mount of work there ;)
the specializations are the hard part
Ahhh
I was thinking of:
(1) make caching work for cases where specialization is not required
(2) where strategically possible, reduce the need for specialization by masking types - e.g. a data structure that only ever deals in pointers of a generic data type, that code doesn't actually need to be specialized on the generic type; just have a GenericPointer that you compile for
But yes, that does remove the ability to effectively cache a lot of interesting code
The sort of thing Swift does when compiling generic code into a binary
IIRC .NET will also do some even fancier things like pre-compiling a version of some machine code that's agnostic to things like field offsets, and then do "late patching" in the real offsets after those are resolved
yeah the thing I've heard about what Swift does (which I don't know the details of) is that it's good for caching and ABI stability but very technically thorny
It also doesn't allow inlining across module boundaries, which isn't ideal
Any problems with the cache structure I laid out above? This seems orthogonal enough that someone else could work on this in parallel if they wanted to.
I can say in the issue that whoever picks it up should expect discussion when they make a PR
I'm a little wary of the build.lock (I feel like I pretty regularly ran into issues with cargo's version of that for a _long_ time before they polished it up)
One thing you may have to be careful of is windows compat issues with path length
It looks like those paths can get pretty long
One thing that crossed my mind is instead of caching on the filesystem, you could use something like sqlite
If you do that on the right scope, you could make roc build --clear-cache
or whatever be really fast (just deleting a handful of db files), rather than thousands of build artifacts
yeah I don't think we should need a build.lock-type thing
Should be deterministic for file output, just have to make sure that two processes don't write to the same file at the same time. Is that a problem? Seems like writing to a random file in /tmp/
would mean that we only have to move a file to the cache dir
But that's extra work that we might not have to do
Mutiple copies in the cache likely is a good thing. It is common to work on multiple git branches that may have the same file in different states. So I don't think limiting to one entry per file is the right call
@Brendan Hansknecht this is tricky, since it'd be nice to have a deterministic and "close to pure" (meaning using few parameters) means for caching. So making something that doesn't need a coordinated strategy where we read from a file that lists the last N files would be great
I agree that I am suggesting that we trade off performance for cache size here, but I think the performance improvement isn't that important here since the cached artifacts are for the fast part of the compiler anyway
So we should start with something that is easy to implement correctly (which I believe the above strategy would be), and then we can try to make this cache more files in the future
Do we want to trust that packages on the host system have not been modified? For example, if I download the code for the latest version of Weaver to a folder named ~/.roc/packages/github.com/smores56/weaver/0.5.1/nqyqbOkpECWgDUMbY-rG9ug883TVbOimHZFHek-bQeI.tar.br/src/...
, I'd ideally want to keep it unarchived in our system to avoid the need to decompress the archive every time. But what if someone edited the hash in their Roc app for the package and also the hash in the global Roc cache?
This seems most likely to be self-inflicted
But the safe bet is to just save the full archive and decompress it for now, and then try to avoid the decompression cost down the road
Unless we don't care about this security concern, which would likely be because we don't think it's a real concern
Created a detailed issue for this, would appreciate a review if someone gets a chance: https://github.com/roc-lang/roc/issues/7517
I wrote it based on my current understanding of the plan, but I am happy to change the details based on the results of this discussion
I just thought it'd be good to get everything written down while it was fresh in my mind.
hm, why would we have the project directory structure in the cache dir instead of just a flat, un-namsespaced collection of hashes?
in other words, instead of:
build/<project hash>/<roc version>/<file content hash>
why not
build/<roc version>/<file content hash>
also I do think we should do the "write to tmpdir and then move afterwards" thing - otherwise if multiple roc compiler processes are running at the same time, there can be race conditions around partially-written files in the cache :big_smile:
~/.roc/packages/<repository website>/<username>/<project name>/<version>/<archive hash>/...
we already have a format for these, which I think should stay as-is!
regarding decompression, I don't think there's a real security benefit to leaving the files compressed. If an attacker has gotten write access to that directory, they can just write fake cache files directly which do whatever malicious thing they want. I think it's better to leave them decompressed so we don't have to keep redoing that work!
(also regarding the comment on the issue about XDG - we already do that for our cached package downloads! There's a whole algorithm we use to determine where the roc cache dir should go.)
Sam Mohr said:
So we should start with something that is easy to implement correctly (which I believe the above strategy would be), and then we can try to make this cache more files in the future
Yes, but we need to make sure we design in ways that enable more flexibility in the future. Avoid simplify so much that we design ourselves into a corner and need a large rewrite to enable more functionality.
Also, we probably should talk to the zig folks about this. I think they just went through tons of incremental build work. I would guess they can tell us a good bit about pitfalls
Richard Feldman said:
I definitely think we should only ever cache things in the home dir
I really dislike this. It makes it very unclear which projects are eating up diskspace. I much prefer all artifacts from a single project in a single location. That is much easier to cleanup and understand
that's interesting
I hadn't thought of that
hm, why would we have the project directory structure in the cache dir instead of just a flat, un-namespaced collection of hashes?
The goal with this is to know which build artifacts belong to which project on the user's system. If we have them in one big folder, then we can't just cull any build artifacts that don't correspond to current project files, because we'd delete build artifacts for all other projects.
I preferred a hash of the main.roc
file because that shouldn't move much, and it keeps our build cache flatter.
We should probably offer a roc cache project-path
command that would return the path to the current project's ~/.roc/build/<hash>/
path so you can do "$(roc cache project-path)" for when you want to check the cache dir in scripts.
This could be accompanied by a roc cache project-clean
command, pretty easy to clean the project with that
Updated the description in the issue:
Brendan Hansknecht said:
Richard Feldman said:
I definitely think we should only ever cache things in the home dir
I really dislike this. It makes it very unclear which projects are eating up diskspace. I much prefer all artifacts from a single project in a single location. That is much easier to cleanup and understand
the flip side of this is that it's undesirable for Roc scripts to clutter the local directory with a cache dir, but still desirable for them to have caching for repeat run perf.
I also like the idea of Roc not needing a .gitignore because by default it just doesn't create any local stuff you're supposed to ignore
what about having commands like roc cache clean
and roc cache size
to give you insight into that?
if you want to get that info across a bunch of projects on disk, it's probably about as much work to ask Claude to write you a shell one-liner to go run that command on all the projects as it would be to get all the sizes of one of their subdirectories (if cache dirs were local to the project)
Yeah, I think the biggest issue comes when you don't realize some old project (potentially otherwise deleted) is wasting a ton of cache space. Cause you don't actually want to just delete the whole cache. And you also don't want to have to manually remember some random old project. It's easy to see a projects directory is wasting space (see this all he time rust). Would be much harder to dig into the same with central caching. Obviously tooling can make it work, but I prefer the file system to just be that tool instead of needing to learn new tooling.
Want to quickly add a +1 to the idea of Roc not putting cache files in the project directory. I think in particular for Roc to be as nice for scripting as a dedicated scripting language, a small part of that is that it doesn't create a bunch of helper files in the project directory.
For cache management, I wonder if there's some nice automated heuristics that would mean that Roc is respectful of disk space without needing active management. Something like:
Such heuristics might be easier to implement if the entire cache lives in a single place. If there's bits of Roc cache strewn about the file system, then Roc won't know where all the cache is, and so won't be able to manage the whole either. At that point manually managing the cache will be the user's only option.
If we add strategies like this, we should make them opt in.
Or at a minimum opt out
You think?
Disk space is practically free and I have seen plenty of cases where it is preferable to just eat a ton of it.
Also, doing small deletions on every cache use will be bad for performance.
Obviously some solutions can still be grouped, but I think this is why tools like nix just do a large GC call to cleanup caching. Gives the user control.
Brendan Hansknecht said:
Or at a minimum opt out
I'd vote for this. We can have sensible defaults for the common path and experience.
I imagine that I wouldn't even think about my roc cache size until it is at least 10GB. Depending on the machine, probably closer to 100GB
I think those are important aspects, but wonder if they can be folded into a "sufficiently smart" cache management strategy :sweat_smile:.
For instance, maybe the amount of disk Roc is okay using is not relative to total disk space, but to free disk space. And deletions can happen in a background job, I think Richard mentioned a daemon before.
Nix is an interesting case. I used to manually run a nix garbage-collect job periodically when it occurred to me or, more likely, when I ran into some disk-full-related errors.
At some point I enabled a "automatically garbage collect old stuff on every nixos-rebuild" and I've not actively needed to worry about Nix disk space usage since. It's been much nicer!
One idea might be to do the build, and then follow up with the review/cleanup
Jasper Woudenberg said:
I think those are important aspects, but wonder if they can be folded into a "sufficiently smart" cache management strategy :sweat_smile:.
Yeah, I totally think this is doable. There can be sane defaults and a config in the cache folder to give more control.
Richard Feldman said:
I think we can limit cache growth by having a "background job" that goes and deletes old cache files based on last access time
there are points where compilation gets bottlenecked and we can't productively use all the cores just because things are blocked, and during those times we can put all the idle cores on garbage-collecting old cache files until they're unblocked again
yeah I think this would be the nicest way to do it if it can work - no daemon, just make use of idle cores during every build to quietly go around deleting expired cache entries until either it runs out or the build needs the core again
In that case, we could get away with not needing to stick to a build/<project main.roc hash>/<roc version>/<file hash>
strategy and just go for build/<roc version>/<file hash>
I still think the former is a very simple strategy that will work until we figure out our cleaning strategy
But this isn't gonna be implemented for a while, so I don't think it matters yet
I would also much prefer to have all of the caching in a single directory and avoid a per-project build dir. I think heuristics about cleaning up the cache can probably solve the problem nicely, but even if they can't, deleting the whole cache once in a blue moon if it got too large would not be that big of a deal.
The only thing that deleting the entire ~/.cache/roc/
folder would break would be the compiler being missing
Everything else would survive
Even then, if we default to storing the current Roc bin in /usr/local/bin/
or something, then it wouldn't even break
I think the compiler install should go into ~/.local/roc
rather than /usr/local/bin/
or ~/.cache/roc/
The semantics of the .cache dir are supposed to be that dropping it won't cause anything to break - but clearly if the roc compiler is stored in there and you drop it, you'll clearly have broken your workflow.
I remember there was some pushback, but this is one of the reasons I think we should not do symlinks for version switching of the roc
executable itself, and instead it should update itself in-place
if it does that, then the downloads are actually just cache in case you want to switch back to that version, and there's no problem with deleting the cache dir
I think ~/.local/bin/roc. At least that where I’d move it if it defaults to someplace else
I have my path setup to make anything put there to work right away , or after a hash -r
I think XDG_STATE_DIR is where a rocup
like tool would store compiler versions
And symlink from there
yeah :point_up: is what I think we shouldn't do :sweat_smile:
It's unusual to have the compiler also be the install tool, but the idea of having a single binary for literally everything is just MAGICAL
So yeah, symlinks are only needed if we can't get that working
Otherwise, Richard's plan is just objectively better in my eyes
Would we ever want to allow different projects to pin different versions of roc?
The plan is to allow configuring the version at the top of your main.roc
I think that’s something that has happened and literally every other program language community. It’s not obvious to be why roc would not also want to support that.
The language version and compiler version are two different things though
Hmmm
It may be important to pin compiler version.
(Not sure)
Could you give an example where a properly semvered language version couldn't do that?
I'm not sure what the difference is between a compiler and a lang version
If I were developing a big app I deploy to production, I'd sure as hell want to pin the exact compiler version
Too much risk of uncontrolled compiler bugs causing havoc
Better to have compiler upgrades be like any other commit that you can revert
Rust only does a semver, seems to work okay for AWS
Because that pins a single compiler version for their release schedule
At my company we pin an exact nightly version
I suppose that may be an artifact of being stuck on nightly
That pins a more specific set of features than would be available with the "every six weeks" release
And also a more specific set of bugs. There have definitely been times when we had to choose a different version to pin because our release pipeline picked up things that turned out to be compiler bugs
But it's not like you could get a different compiler with the same semver
Yeah, that's fair
Even roc pins rust to an exact version despite semver suggesting we could use a newer version
I think it is an exceptionally common use case
Then we can just allow setting a commit hash at the top of the file or something
The version at the top should be pointable at any release we have in GitHub releases
I was assuming that we'd make all of those different semvers (including nighties, somehow)
If that's not the case, then semver doesn't cut it, sure
one of the things in the design doc in #ideas > compiler version management is that you can pass a CLI flag to have roc
run a different version of roc
from your cached downloads (after downloading it first if necessary)
and also the same thing can be done in a .roc
file
so no symlinking needed for that use case either! :big_smile:
Interesting. So roc still auto updates to the lastest version, but it can run old version from the cache.
well I'd want the update to be manual/opt-in, but yeah
Richard Feldman said:
yeah :point_up: is what I think we shouldn't do :sweat_smile:
This is a pretty common pattern for tooling. And I've seen (in my own company) the "we'll defer to a different global version from a local version" fall on it's face pretty terribly. Now maybe I need to re-read this entire thread top to bottom to see what the fears are with symlinking for this use case (I'm sure if you are making the argument, it's well reasoned)
#ideas > compiler version management is the most relevant thread on this topic! :big_smile:
So the argument is "we want Roc to only require a single executable download (directly by the user) for it to run Roc apps targeting any version"?
If so, then I agree that this makes sense. I've just never seen this work out as well as imagined here. But maybe because those tools were written in Javascript - and most of the issue have more to do with issues with node module resolution than the concept itself. As long as args, pipes, and errors are linked up correctly and the user has final discretion over changes to the filesystem, I guess it should be fine.
Do you know of another language toolchain that works like this? This is like rustup
, cargo
, and rustc
in a single executable.
I don't know of any toolchain that works like this
but I don't think there are any technical barrier to it, just one of those "nobody has done it until someone is the first to do it" things :big_smile:
Awesome. That’s Roc innovation !
Now all we need to have a “version” of Roc :wink:
yeah, as noted in the doc, I'd like to get this in place before 0.1.0 because that way in the future you can use it to switch your current roc
all the way back to 0.1.0 without getting stuck and being able to go forward again
Are we going to have a strategy for compiler commit SHAs/tags as well? And local dev? I guess I can just try to read through all of this and the other thread when I’m done with work
I see the former in there, but not the latter. Obviously that is an edge case for compiler devs mostly
I mentioned nightlies in the doc
Last updated: Jul 06 2025 at 12:14 UTC