small interned strings · ideas · Zulip Chat Archive

Stream: ideas

Topic: small interned strings

Richard Feldman (Jul 27 2025 at 22:26):

@Joshua Warner continuing that topic we'd discussed briefly in the meetup, of storing small strings inline in the 4B interned IDs, here's a proposal for how that format could look:

we have 4 bytes to work with
if the first byte is negative, then this whole thing is actually an index into the interned strings (in other words, if you drop the negation bit, you get exactly what we have today except we can only store ~2 billion strings per module instead of ~4 billion, which is fine)
if all 4 bytes are in the nonzero ASCII range (so, 1-127), then it's just a 4-byte ASCII ident stored inline
we use the sign bit of the other 3 bytes for our 3 attribute flags (prefix _, suffix _, and suffix ! - any combination of these can be set; if all 3 are set, you get e.g. _blah!_ which would be an unused var storing an effectful function)
we can store ASCII idents of length 1-3 by storing trailing zeros (as in, they're zero after clearing the attribute bit) - at which point we can determine the length of the string in 2 instructions by doing a "count trailing zeros" and then dividing by 8 (so, bit shifting by 3)
all of this still leaves us with 4 zero bytes never coming up, so we can use that as a sentinel if we like

Richard Feldman (Jul 27 2025 at 22:28):

so that means we can store _blah!_ inline in 4 bytes, we can still store a little over 50% of the total number of interned strings compared to before, and all we're paying are a few extra bit twiddling ops

Richard Feldman (Jul 27 2025 at 22:28):

plus probably some branch mispredictions here and there, but probably a lot of things can be made branchless

Brendan Hansknecht (Jul 28 2025 at 14:39):

I wonder if this is worth the complexity currently. I also wonder how many identifiers fir in 4 1-127 bytes.

But overall seems reasonable.

Anton (Jul 28 2025 at 14:42):

I wonder if this is worth the complexity currently.

:plus:

Richard Feldman (Jul 28 2025 at 14:48):

Josh did a histogram of symbols on a test code base and it seems like this optimization would apply to more than half of them, so a pretty serious optimization :sweat_smile:

Richard Feldman (Jul 28 2025 at 14:48):

also the original motivation was that string interning was taking up most of the time of the tokenization phase

Joshua Warner (Jul 28 2025 at 15:25):

Not _most_ of the time - about 25%, but it is the single biggest chunk of time currently.

Joshua Warner (Jul 28 2025 at 15:27):

You could pack even more bits into there with some trickery :P

Richard Feldman (Jul 28 2025 at 15:40):

what would we use the extra bits for?

Joshua Warner (Jul 28 2025 at 15:49):

Storing longer identifiers of course!

Richard Feldman (Jul 28 2025 at 16:07):

:thinking: example?

Joshua Warner (Jul 28 2025 at 16:08):

Well, if we're sticking with ascii identifiers, there are < 64 valid chars in that bunch, and so we could pack 5-byte identifiers into 32 bits

Joshua Warner (Jul 28 2025 at 16:09):

Or even use a static huffman table :stuck_out_tongue_closed_eyes:

Joshua Warner (Jul 28 2025 at 16:35):

With a static huffman table, ~74% of identifiers could be compressed into 28 bits or less (leaving 3 bits for var/unused/etc, and 1 bit to mark when the identifier is in the table).

Richard Feldman (Jul 28 2025 at 16:35):

oh I see, so it would be like:

const Idx = packed struct(u32) {
    is_inline: bool,
    attributes: enum(u2) { unused, variable, fx, unused_fx },
    first_char: u5,
    other_chars: [4]u6,
}

u6 fits 64 chars, which is exactly enough for (26 lowercase + 26 uppercase + 10 digits + underscore + bang)

interestingly, the first char can fit in u5 (up to 32 possibilities) because the first character of an identifier can't be a digit (or else it would be a number literal), we'd store ! or _ in the attributes, and we can know if it's uppercase vs lowercase from context so we can just always store it as lowercase - so we only need 26 of the 32 possibilities.

this is interesting because actually it reveals an optimization I'd never thought of: store interned identifier strings as always being lowercase, because that way we can save storing e.g. both Serializer as well as serializer as separate interned strings; instead, we store both as the same string, they get the same ID, but because we always know from context whether we're expecting lowercase vs uppercase strings, we can always convert back to the appropriate one for display

Richard Feldman (Jul 28 2025 at 16:38):

also I realized that for attributes only those 4 combinations actually need to be stored; e.g. an unused var would be a warning that it's var for no reason since you never reassign it (and if you reassign it, it's no longer unused), and making a reassignable effectful function would probably come up ~never in practice, so no need to give it a speecial case

Richard Feldman (Jul 28 2025 at 16:39):

what I like about that vs a static huffman table is that we don't need to pay for extra memory for lookups; this all gets decoded using bit shifts etc in the memory we're already loading

Richard Feldman (Jul 28 2025 at 16:40):

that's really cool :smiley:

Joshua Warner (Jul 28 2025 at 16:42):

I don't think decoding will be common. That's mostly for debugging tooling and error messaging.

Richard Feldman (Jul 28 2025 at 16:43):

oh because we usually show it using Region instead? :thinking:

Joshua Warner (Jul 28 2025 at 16:44):

Yes, and even then that's only really for emitting error messages

Joshua Warner (Jul 28 2025 at 16:44):

And I guess things like symbol names in the binary :thinking:

Joshua Warner (Jul 28 2025 at 16:45):

(I don't think we should heavily optimize the error messaging code path - we should cap that at 100 per build or something, and it'll just never be an issue)

Joshua Warner (Jul 28 2025 at 16:48):

All this said - I'm not sure how far it's worth going down this road (yet), since it limits some potential future optimizations.

Joshua Warner (Jul 28 2025 at 16:50):

e.g. Carbon uses dense identifier indices that can be reasonably packed into 23 bits and stored _inline_ in the token representation.

Richard Feldman (Jul 28 2025 at 16:50):

how does that work?

Joshua Warner (Jul 28 2025 at 16:51):

They use 8 bits token kind, 1 bit "whitespace before", 23 bits "extra" (used for identifier id, literal id, etc depending on token type), and 32 bits for the byte offset in the file.

Joshua Warner (Jul 28 2025 at 16:53):

23 bits (~8 million) is reasonable when you're talking about a limit on unique identifiers

Richard Feldman (Jul 28 2025 at 17:02):

do we just store string literals as regions? or do we intern those too?

Joshua Warner (Jul 28 2025 at 17:04):

We don't currently intern those. Neither does Carbon, but they do copy it to a string table so they don't have to reference the source (just no deduping).

Richard Feldman (Jul 28 2025 at 17:06):

so we just use the region?

Joshua Warner (Jul 28 2025 at 17:06):

Yes

Richard Feldman (Jul 28 2025 at 17:07):

cool, that design seems good to me!

Joshua Warner (Jul 28 2025 at 17:10):

Semi-related question: would _MyUnusedFakeLowerIdent be considered lowercase? uppercase? error?

Richard Feldman (Jul 28 2025 at 17:13):

good question - I think uppercase, basically "unused tag"

Richard Feldman (Jul 28 2025 at 17:14):

in a hypothetical future where we give warnings for unused tags (e.g. I put it in my type alias or nominal type but I never actually construct it anywhere), this would be a way to be like "yeah I know but I still want to put it in here as a reminder or something"

Joshua Warner (Jul 28 2025 at 17:14):

"yeah I know but I still want to put it in here as a reminder or something"

Sounds like a great use case for a comment!

Joshua Warner (Jul 28 2025 at 17:15):

FWIW these are currently treated as NamedUnderscore which is basically always treated as if it were lowercase in the parser.

Austin Davis (Jul 28 2025 at 17:52):

Forgive my ignorance, but is this string-interning optimization supposed to apply to the Roc compiler's handling of Roc source code, or the runtime representation of dynamically allocated strings, or both?

I'm curious because I've heard about the latter being done before, but I'm not sure if I've actually seen it done in practice. I know Java has a java.lang.String.intern() method, but it definitely has a performance cost that grows with the number of interned strings. Just wondering if that feature is on the roadmap.

A more concrete reason I'm asking is because it might be interesting to tinker with dynamic string interning in my compiler project I'm writing in Roc.

Richard Feldman (Jul 28 2025 at 18:15):

just a compiler internal thing

Richard Feldman (Jul 28 2025 at 18:15):

wouldn't affect how running Roc programs behave

Austin Davis (Jul 28 2025 at 18:42):

Awesome, this is turning out to be such an impressive compiler IMO!

Brendan Hansknecht (Jul 28 2025 at 19:55):

Brendan Hansknecht said:

I wonder if this is worth the complexity currently. I also wonder how many identifiers fir in 4 1-127 bytes.

But overall seems reasonable.

When I say this, just to clarify... I more mean this in terms of spending time here right now vs farther down the compiler. Like we still have so much of the compiler that just doesn't exist or that is just uber slow and we theoretically want the interpreter fully functional with platforms for AOC.

So more a prioritization comment than a complexity one. Sure, this could be way faster, but that will be pretty small compared to getting the compiler working as a whole or compared to the slower stages of the compiler like type checking.

Richard Feldman (Jul 28 2025 at 21:23):

totally fair, although I like having Claude work on multiple implementations in parallel, and this one is well-scoped :smile:

Richard Feldman (Jul 28 2025 at 21:36):

Joshua Warner said:

We don't currently intern those. Neither does Carbon, but they do copy it to a string table so they don't have to reference the source (just no deduping).

I just realized region won't always work; if we have escapes in the literal, we have to resolve those

Joshua Warner (Jul 28 2025 at 21:37):

True

Joshua Warner (Jul 28 2025 at 21:37):

But that can be done just as well when we're about to use the string, as when we're tokenizing.

Richard Feldman (Jul 28 2025 at 21:37):

probably eventually for optimized builds we should dedupe anyway

Richard Feldman (Jul 28 2025 at 21:38):

eh we do need to report invalid escapes at compile time

Richard Feldman (Jul 28 2025 at 21:38):

so we need to do it even during roc check

Joshua Warner (Jul 28 2025 at 21:38):

Checking for invalid escapes should definitely be done during the tokenization

Joshua Warner (Jul 28 2025 at 21:39):

If we want to copy the string somewhere during tokenization, we might as well resolve escapes at the same time

Richard Feldman (Jul 28 2025 at 21:41):

agreed!

Joshua Warner (Jul 28 2025 at 21:42):

Actually let me walk that back a bit...

Joshua Warner (Jul 28 2025 at 21:43):

Imagine we're in the interpreter, and we have a absolute boatload of string constants we won't happen to use.

Joshua Warner (Jul 28 2025 at 21:44):

In that case, we are actually served well (perf-wise) by not validating until use

Joshua Warner (Jul 28 2025 at 21:44):

And arguably those error messages won't necessarily help the user either, since (again) that code isn't used.

Joshua Warner (Jul 28 2025 at 21:45):

Fully agree that the case of roc check should still give all error messages there tho, even if unused.

Last updated: Jul 23 2026 at 13:15 UTC