Multiple interners or single interner · compiler development

Stream: compiler development

Topic: Multiple interners or single interner

Sam Mohr (Feb 12 2025 at 07:02):

Currently the new compiler has multiple interners based roughly on the outline Agus made for the Rust-based compiler's specialize_types pass. There's an array of tag_names, field_names, idents, etc. I thought that copying this would be a good pattern to help get the type safety that we got in Rust, e.g. an Index<TagName> in Rust can only be used for looking up tag names, mirrored in the Zig rewrite with a TagName.Idx.

However, I'm not sure if this is the right move. For one, the interners for everything but string literals are deduplicated, meaning we can't deduplicate variables names and record field names with the same names, taking more space. More importantly, if we want to look up a name, we can only look for things with that intern kind, meaning we can't do constant-time lookup of a tag name using a type name. That shouldn't be too much of a problem though: the only place I think the same token represents two different interned types is name punning, AKA records like { field_name }. That just means we'd need to convert that to { field_name: field_name } during desugaring and potentially copy the name field_name from the record field interner to the ident one.

So with that in mind, I'm not seeing one option as definitely better but I'd rather have type safety than a little space conservation.

For the parsing team, would there be any problems with interning into different interners based on the token type? I presume we'd be good to do that. If so, I'll probably make a change to break into more groups of types, e.g. one for module names, one for type names, etc.

Sam Mohr (Feb 12 2025 at 07:03):

And to those outside of @Joshua Warner and @Anthony Bullard , do you think there's a strong reason to prefer one or multiple interners?

Joshua Warner (Feb 12 2025 at 07:07):

My thinking right now is to have interning happen in canonicalization, not parsing

Sam Mohr (Feb 12 2025 at 07:08):

Hmm. Anthony suggested otherwise, but it'd be okay

Joshua Warner (Feb 12 2025 at 07:09):

I don't think there's a strong benefit to doing it earlier, since the ast needs to be carrying around references/indices into the text anyway - and if that's all it's carrying around we can maybe keep the ast smaller.

Sam Mohr (Feb 12 2025 at 07:09):

The benefit of doing it during parsing is that it's an O(n) pass over the data which allows us to do mostly constant time string comparisons (outside of checking imports) and also carry around less data starting with canonicalization

Sam Mohr (Feb 12 2025 at 07:10):

I guess it's a question if dropping the full text earlier during compilation would be considered beneficial

Joshua Warner (Feb 12 2025 at 07:10):

I definitely buy constant time string comparisons as a benefit

Sam Mohr (Feb 12 2025 at 07:10):

That happens as soon as interning happens

Joshua Warner (Feb 12 2025 at 07:10):

... but I don't want the parser to have to talk to some complicated multithreaded interning system

Sam Mohr (Feb 12 2025 at 07:11):

It's not multithreaded

Joshua Warner (Feb 12 2025 at 07:11):

That's going to slow down the parser

Sam Mohr (Feb 12 2025 at 07:11):

There's an interner per module

Joshua Warner (Feb 12 2025 at 07:11):

Ahhh; ok; that's somewhat less concerning

Sam Mohr (Feb 12 2025 at 07:11):

Specifically one of these: https://github.com/roc-lang/roc/blob/main/src/base/ModuleEnv.zig

Sam Mohr (Feb 12 2025 at 07:12):

The other thought is that we could check for ident attributes and problems in the same pass as we tokenize, or at least as we parse

Joshua Warner (Feb 12 2025 at 07:12):

Still, I'd rather have more work in can than in parsing (supposing we're just shifting it around, and not making lots more/less work). That way, things like formatting can be faster (very common to do across a whole file or even whole codebase!)

Sam Mohr (Feb 12 2025 at 07:13):

Because if we intern during canonicalization, we have to read over each ident in full (not just the starting and ending chars) to check for warning info, like multiple underscores in a row or uppercase letters in a lowercase ident (we want to warn on camelCase)

Sam Mohr (Feb 12 2025 at 07:13):

You're right, this would make formatting faster...

Joshua Warner (Feb 12 2025 at 07:14):

(re)building an ast after saving a file, too

Sam Mohr (Feb 12 2025 at 07:14):

It'd be nice to have some estimate whether interning during parsing would be faster for overall compilation time

Sam Mohr (Feb 12 2025 at 07:14):

Because if it's not, then it should definitely be done during can

Sam Mohr (Feb 12 2025 at 07:15):

I only suggested it because I thought it would speed up overall compilation, even if it makes formatting slower

Joshua Warner (Feb 12 2025 at 07:15):

If anything, I'd intern during tokenization rather than parsing

Sam Mohr (Feb 12 2025 at 07:15):

Yeah, that was my thought too

Joshua Warner (Feb 12 2025 at 07:15):

That way the parser never has to look at the source

Sam Mohr (Feb 12 2025 at 07:15):

yep

Joshua Warner (Feb 12 2025 at 07:15):

(just the tokens)

Joshua Warner (Feb 12 2025 at 07:15):

I'd still like to push that later, until/unless we have some evidence to the contrary

Sam Mohr (Feb 12 2025 at 07:16):

A reasonable decision

Sam Mohr (Feb 12 2025 at 07:16):

Not sure how to get that evidence

Sam Mohr (Feb 12 2025 at 07:17):

Let me ask in a separate thread about regions if you have a plan for those

Sam Mohr (Feb 12 2025 at 07:20):

For other readers, if you have an opinion on the root question of this thread, I'd still love to hear it

Richard Feldman (Feb 12 2025 at 13:09):

if we're doing roc check, doing it during tokenization makes the most sense because the whole string will be in L1 cache right then

Richard Feldman (Feb 12 2025 at 13:09):

if we're doing roc format, not doing it at all makes the most sense :big_smile:

Richard Feldman (Feb 12 2025 at 13:10):

deduplicating string literals only comes up if we're doing roc build and not using the interpreter as the backend

Richard Feldman (Feb 12 2025 at 13:29):

I think doing it per-module instead of globally makes the most sense

Richard Feldman (Feb 12 2025 at 13:29):

it avoids contention when we get to parallelizing things, because the interns are only used by the current module

Richard Feldman (Feb 12 2025 at 13:31):

(or if they're being used by other modules it's because we're translating IDs from one module to another, but at that point they aren't being written to)

Richard Feldman (Feb 12 2025 at 13:33):

I think we'll also need to serialize them to disk as part of caching, which of course we want to do on a per-module basis instead of globally

Richard Feldman (Feb 12 2025 at 13:35):

also having them per-module makes it easier for the language server to manage their memory, since otherwise we'd need to refcount each individual interned string to know when it was safe to free, compared to just resetting a particular module's arena and redoing parsing etc

Richard Feldman (Feb 12 2025 at 13:36):

oh and also per-module means there's more cache coherence, so fewer cache misses when using interns because there aren't a bunch of gaps from indentifiers that aren't even used in the module and won't come up

Richard Feldman (Feb 12 2025 at 13:47):

as an aside, I think there are some potentially very interesting ways we can reduce memory usage in interns by packing them into size and capitalization categories and using simd to find matches, but all of that is definitely not stuff we should do for 0.1.0 :big_smile:

Joshua Warner (Feb 12 2025 at 15:10):

Ok, if I’m reading things right, the lexer for Carbon (for which I know that team has been very performance-focused), does ident interning. Here: https://github.com/carbon-language/carbon-lang/blob/trunk/toolchain/lex/lex.cpp

Joshua Warner (Feb 12 2025 at 15:10):

That’s good enough reason to me

Joshua Warner (Feb 12 2025 at 15:11):

At least we can be relatively confident it’s not a serious perf problem

Richard Feldman (Feb 12 2025 at 15:26):

yeah and in the future we can potentially have roc format tell it to use a "null interner" which just returns 0 for every ident (or something) because we know it's never going to look them up again

Sam Mohr (Feb 12 2025 at 16:23):

Richard Feldman said:

deduplicating string literals only comes up if we're doing roc build and not using the interpreter as the backend

During canonicalization, we can check if the same variable/type/tag/etc. name is used while scope checking in constant time if all strings are interned, so I think it's beneficial before interpreting too

Richard Feldman (Feb 12 2025 at 16:44):

I mean string literals specifically

Richard Feldman (Feb 12 2025 at 16:44):

like "foo" in the Roc code cli_arg_name = "foo"

Richard Feldman (Feb 12 2025 at 16:45):

as opposed to identifier names in code

Joshua Warner (Feb 12 2025 at 16:45):

Ah, do we need to unique those?

Richard Feldman (Feb 12 2025 at 16:45):

we want to deduplicate those when compiling to a static binary, so that we don't put the same string in the binary's readonly section more than once

Richard Feldman (Feb 12 2025 at 16:46):

there's no benefit to having duplicates since they're readonly

Joshua Warner (Feb 12 2025 at 16:46):

Ah true; but I would assume that happens later

Richard Feldman (Feb 12 2025 at 16:46):

yeah that's what I mean

Richard Feldman (Feb 12 2025 at 16:46):

we only need to dedupe those if we're building a binary

Richard Feldman (Feb 12 2025 at 16:46):

but check and format have no reason to do that work

Richard Feldman (Feb 12 2025 at 16:46):

(and those strings can be pretty big!)

Sam Mohr (Feb 12 2025 at 16:47):

Yep, those are not deduped right now

Sam Mohr (Feb 12 2025 at 16:47):

Because big and unique

Richard Feldman (Feb 12 2025 at 16:47):

yeah, so I don't think we should use interns for those (at least at this stage)

Richard Feldman (Feb 12 2025 at 16:47):

better to leave that for a later pass

Sam Mohr (Feb 12 2025 at 16:48):

We still put them in a big array to keep IR nodes smaller

Sam Mohr (Feb 12 2025 at 16:48):

Not sure if you think that's actively a bad idea

Richard Feldman (Feb 12 2025 at 16:48):

oh I think we can just reference original source range

Sam Mohr (Feb 12 2025 at 16:48):

Oh, that's better

Sam Mohr (Feb 12 2025 at 16:48):

I'll hook that up

Richard Feldman (Feb 12 2025 at 16:49):

that does mean the interpreter has to redo a bit of work

Richard Feldman (Feb 12 2025 at 16:49):

to do things like resolve \ escapes

Richard Feldman (Feb 12 2025 at 16:49):

which maybe isn't worth it

Richard Feldman (Feb 12 2025 at 16:50):

could always have 2 different IR nodes, one for like "literal with no escapes" that just stores a source code range and another for "escaped string literal" which points to the post-processed one

Sam Mohr (Feb 12 2025 at 16:51):

I guess that's okay complexity to eat today since we'd only need to deal with this during check

Sam Mohr (Feb 12 2025 at 16:51):

The later phases would get back a normal string

Brendan Hansknecht (Feb 12 2025 at 23:46):

Richard Feldman said:

we want to deduplicate those when compiling to a static binary, so that we don't put the same string in the binary's readonly section more than once

I think we can trivially do that when generating llvm ir. We generate a single ir module so we can literally keep a map of seen strings if we want.

Brendan Hansknecht (Feb 12 2025 at 23:48):

Richard Feldman said:

that does mean the interpreter has to redo a bit of work

I guess that would mean we have to keep all source in memory which doesn't sound good. Or the interpreter has to re-open every file to pull all the strings, which sounds kinda strange, but ok.

Last updated: Jul 26 2025 at 12:14 UTC