Currently the new compiler has multiple interners based roughly on the outline Agus made for the Rust-based compiler's specialize_types
pass. There's an array of tag_names, field_names, idents, etc. I thought that copying this would be a good pattern to help get the type safety that we got in Rust, e.g. an Index<TagName>
in Rust can only be used for looking up tag names, mirrored in the Zig rewrite with a TagName.Idx
.
However, I'm not sure if this is the right move. For one, the interners for everything but string literals are deduplicated, meaning we can't deduplicate variables names and record field names with the same names, taking more space. More importantly, if we want to look up a name, we can only look for things with that intern kind, meaning we can't do constant-time lookup of a tag name using a type name. That shouldn't be too much of a problem though: the only place I think the same token represents two different interned types is name punning, AKA records like { field_name }
. That just means we'd need to convert that to { field_name: field_name }
during desugaring and potentially copy the name field_name
from the record field interner to the ident one.
So with that in mind, I'm not seeing one option as definitely better but I'd rather have type safety than a little space conservation.
For the parsing team, would there be any problems with interning into different interners based on the token type? I presume we'd be good to do that. If so, I'll probably make a change to break into more groups of types, e.g. one for module names, one for type names, etc.
And to those outside of @Joshua Warner and @Anthony Bullard , do you think there's a strong reason to prefer one or multiple interners?
My thinking right now is to have interning happen in canonicalization, not parsing
Hmm. Anthony suggested otherwise, but it'd be okay
I don't think there's a strong benefit to doing it earlier, since the ast needs to be carrying around references/indices into the text anyway - and if that's all it's carrying around we can maybe keep the ast smaller.
The benefit of doing it during parsing is that it's an O(n) pass over the data which allows us to do mostly constant time string comparisons (outside of checking imports) and also carry around less data starting with canonicalization
I guess it's a question if dropping the full text earlier during compilation would be considered beneficial
I definitely buy constant time string comparisons as a benefit
That happens as soon as interning happens
... but I don't want the parser to have to talk to some complicated multithreaded interning system
It's not multithreaded
That's going to slow down the parser
There's an interner per module
Ahhh; ok; that's somewhat less concerning
Specifically one of these: https://github.com/roc-lang/roc/blob/main/src/base/ModuleEnv.zig
The other thought is that we could check for ident attributes and problems in the same pass as we tokenize, or at least as we parse
Still, I'd rather have more work in can than in parsing (supposing we're just shifting it around, and not making lots more/less work). That way, things like formatting can be faster (very common to do across a whole file or even whole codebase!)
Because if we intern during canonicalization, we have to read over each ident in full (not just the starting and ending chars) to check for warning info, like multiple underscores in a row or uppercase letters in a lowercase ident (we want to warn on camelCase)
You're right, this would make formatting faster...
(re)building an ast after saving a file, too
It'd be nice to have some estimate whether interning during parsing would be faster for overall compilation time
Because if it's not, then it should definitely be done during can
I only suggested it because I thought it would speed up overall compilation, even if it makes formatting slower
If anything, I'd intern during tokenization rather than parsing
Yeah, that was my thought too
That way the parser never has to look at the source
yep
(just the tokens)
I'd still like to push that later, until/unless we have some evidence to the contrary
A reasonable decision
Not sure how to get that evidence
Let me ask in a separate thread about regions if you have a plan for those
For other readers, if you have an opinion on the root question of this thread, I'd still love to hear it
if we're doing roc check
, doing it during tokenization makes the most sense because the whole string will be in L1 cache right then
if we're doing roc format
, not doing it at all makes the most sense :big_smile:
deduplicating string literals only comes up if we're doing roc build
and not using the interpreter as the backend
I think doing it per-module instead of globally makes the most sense
it avoids contention when we get to parallelizing things, because the interns are only used by the current module
(or if they're being used by other modules it's because we're translating IDs from one module to another, but at that point they aren't being written to)
I think we'll also need to serialize them to disk as part of caching, which of course we want to do on a per-module basis instead of globally
also having them per-module makes it easier for the language server to manage their memory, since otherwise we'd need to refcount each individual interned string to know when it was safe to free, compared to just resetting a particular module's arena and redoing parsing etc
oh and also per-module means there's more cache coherence, so fewer cache misses when using interns because there aren't a bunch of gaps from indentifiers that aren't even used in the module and won't come up
as an aside, I think there are some potentially very interesting ways we can reduce memory usage in interns by packing them into size and capitalization categories and using simd to find matches, but all of that is definitely not stuff we should do for 0.1.0 :big_smile:
Ok, if I’m reading things right, the lexer for Carbon (for which I know that team has been very performance-focused), does ident interning. Here: https://github.com/carbon-language/carbon-lang/blob/trunk/toolchain/lex/lex.cpp
That’s good enough reason to me
At least we can be relatively confident it’s not a serious perf problem
yeah and in the future we can potentially have roc format
tell it to use a "null interner" which just returns 0 for every ident (or something) because we know it's never going to look them up again
Richard Feldman said:
deduplicating string literals only comes up if we're doing
roc build
and not using the interpreter as the backend
During canonicalization, we can check if the same variable/type/tag/etc. name is used while scope checking in constant time if all strings are interned, so I think it's beneficial before interpreting too
I mean string literals specifically
like "foo"
in the Roc code cli_arg_name = "foo"
as opposed to identifier names in code
Ah, do we need to unique those?
we want to deduplicate those when compiling to a static binary, so that we don't put the same string in the binary's readonly section more than once
there's no benefit to having duplicates since they're readonly
Ah true; but I would assume that happens later
yeah that's what I mean
we only need to dedupe those if we're building a binary
but check
and format
have no reason to do that work
(and those strings can be pretty big!)
Yep, those are not deduped right now
Because big and unique
yeah, so I don't think we should use interns for those (at least at this stage)
better to leave that for a later pass
We still put them in a big array to keep IR nodes smaller
Not sure if you think that's actively a bad idea
oh I think we can just reference original source range
Oh, that's better
I'll hook that up
that does mean the interpreter has to redo a bit of work
to do things like resolve \
escapes
which maybe isn't worth it
could always have 2 different IR nodes, one for like "literal with no escapes" that just stores a source code range and another for "escaped string literal" which points to the post-processed one
I guess that's okay complexity to eat today since we'd only need to deal with this during check
The later phases would get back a normal string
Richard Feldman said:
we want to deduplicate those when compiling to a static binary, so that we don't put the same string in the binary's readonly section more than once
I think we can trivially do that when generating llvm ir. We generate a single ir module so we can literally keep a map of seen strings if we want.
Richard Feldman said:
that does mean the interpreter has to redo a bit of work
I guess that would mean we have to keep all source in memory which doesn't sound good. Or the interpreter has to re-open every file to pull all the strings, which sounds kinda strange, but ok.
Last updated: Jul 06 2025 at 12:14 UTC