Stream: ideas

Topic: small interned strings


view this post on Zulip Richard Feldman (Jul 27 2025 at 22:26):

@Joshua Warner continuing that topic we'd discussed briefly in the meetup, of storing small strings inline in the 4B interned IDs, here's a proposal for how that format could look:

view this post on Zulip Richard Feldman (Jul 27 2025 at 22:28):

so that means we can store _blah!_ inline in 4 bytes, we can still store a little over 50% of the total number of interned strings compared to before, and all we're paying are a few extra bit twiddling ops

view this post on Zulip Richard Feldman (Jul 27 2025 at 22:28):

plus probably some branch mispredictions here and there, but probably a lot of things can be made branchless

view this post on Zulip Brendan Hansknecht (Jul 28 2025 at 14:39):

I wonder if this is worth the complexity currently. I also wonder how many identifiers fir in 4 1-127 bytes.

But overall seems reasonable.

view this post on Zulip Anton (Jul 28 2025 at 14:42):

I wonder if this is worth the complexity currently.

:plus:

view this post on Zulip Richard Feldman (Jul 28 2025 at 14:48):

Josh did a histogram of symbols on a test code base and it seems like this optimization would apply to more than half of them, so a pretty serious optimization :sweat_smile:

view this post on Zulip Richard Feldman (Jul 28 2025 at 14:48):

also the original motivation was that string interning was taking up most of the time of the tokenization phase

view this post on Zulip Joshua Warner (Jul 28 2025 at 15:25):

Not _most_ of the time - about 25%, but it is the single biggest chunk of time currently.

view this post on Zulip Joshua Warner (Jul 28 2025 at 15:27):

You could pack even more bits into there with some trickery :P

view this post on Zulip Richard Feldman (Jul 28 2025 at 15:40):

what would we use the extra bits for?

view this post on Zulip Joshua Warner (Jul 28 2025 at 15:49):

Storing longer identifiers of course!

view this post on Zulip Richard Feldman (Jul 28 2025 at 16:07):

:thinking: example?

view this post on Zulip Joshua Warner (Jul 28 2025 at 16:08):

Well, if we're sticking with ascii identifiers, there are < 64 valid chars in that bunch, and so we could pack 5-byte identifiers into 32 bits

view this post on Zulip Joshua Warner (Jul 28 2025 at 16:09):

Or even use a static huffman table :stuck_out_tongue_closed_eyes:

view this post on Zulip Joshua Warner (Jul 28 2025 at 16:35):

With a static huffman table, ~74% of identifiers could be compressed into 28 bits or less (leaving 3 bits for var/unused/etc, and 1 bit to mark when the identifier is in the table).

view this post on Zulip Richard Feldman (Jul 28 2025 at 16:35):

oh I see, so it would be like:

const Idx = packed struct(u32) {
    is_inline: bool,
    attributes: enum(u2) { unused, variable, fx, unused_fx },
    first_char: u5,
    other_chars: [4]u6,
}

u6 fits 64 chars, which is exactly enough for (26 lowercase + 26 uppercase + 10 digits + underscore + bang)

interestingly, the first char can fit in u5 (up to 32 possibilities) because the first character of an identifier can't be a digit (or else it would be a number literal), we'd store ! or _ in the attributes, and we can know if it's uppercase vs lowercase from context so we can just always store it as lowercase - so we only need 26 of the 32 possibilities.

this is interesting because actually it reveals an optimization I'd never thought of: store interned identifier strings as always being lowercase, because that way we can save storing e.g. both Serializer as well as serializer as separate interned strings; instead, we store both as the same string, they get the same ID, but because we always know from context whether we're expecting lowercase vs uppercase strings, we can always convert back to the appropriate one for display

view this post on Zulip Richard Feldman (Jul 28 2025 at 16:38):

also I realized that for attributes only those 4 combinations actually need to be stored; e.g. an unused var would be a warning that it's var for no reason since you never reassign it (and if you reassign it, it's no longer unused), and making a reassignable effectful function would probably come up ~never in practice, so no need to give it a speecial case

view this post on Zulip Richard Feldman (Jul 28 2025 at 16:39):

what I like about that vs a static huffman table is that we don't need to pay for extra memory for lookups; this all gets decoded using bit shifts etc in the memory we're already loading

view this post on Zulip Richard Feldman (Jul 28 2025 at 16:40):

that's really cool :smiley:

view this post on Zulip Joshua Warner (Jul 28 2025 at 16:42):

I don't think decoding will be common. That's mostly for debugging tooling and error messaging.

view this post on Zulip Richard Feldman (Jul 28 2025 at 16:43):

oh because we usually show it using Region instead? :thinking:

view this post on Zulip Joshua Warner (Jul 28 2025 at 16:44):

Yes, and even then that's only really for emitting error messages

view this post on Zulip Joshua Warner (Jul 28 2025 at 16:44):

And I guess things like symbol names in the binary :thinking:

view this post on Zulip Joshua Warner (Jul 28 2025 at 16:45):

(I don't think we should heavily optimize the error messaging code path - we should cap that at 100 per build or something, and it'll just never be an issue)

view this post on Zulip Joshua Warner (Jul 28 2025 at 16:48):

All this said - I'm not sure how far it's worth going down this road (yet), since it limits some potential future optimizations.

view this post on Zulip Joshua Warner (Jul 28 2025 at 16:50):

e.g. Carbon uses dense identifier indices that can be reasonably packed into 23 bits and stored _inline_ in the token representation.

view this post on Zulip Richard Feldman (Jul 28 2025 at 16:50):

how does that work?

view this post on Zulip Joshua Warner (Jul 28 2025 at 16:51):

They use 8 bits token kind, 1 bit "whitespace before", 23 bits "extra" (used for identifier id, literal id, etc depending on token type), and 32 bits for the byte offset in the file.

view this post on Zulip Joshua Warner (Jul 28 2025 at 16:53):

23 bits (~8 million) is reasonable when you're talking about a limit on unique identifiers

view this post on Zulip Richard Feldman (Jul 28 2025 at 17:02):

do we just store string literals as regions? or do we intern those too?

view this post on Zulip Joshua Warner (Jul 28 2025 at 17:04):

We don't currently intern those. Neither does Carbon, but they do copy it to a string table so they don't have to reference the source (just no deduping).

view this post on Zulip Richard Feldman (Jul 28 2025 at 17:06):

so we just use the region?

view this post on Zulip Joshua Warner (Jul 28 2025 at 17:06):

Yes

view this post on Zulip Richard Feldman (Jul 28 2025 at 17:07):

cool, that design seems good to me!

view this post on Zulip Joshua Warner (Jul 28 2025 at 17:10):

Semi-related question: would _MyUnusedFakeLowerIdent be considered lowercase? uppercase? error?

view this post on Zulip Richard Feldman (Jul 28 2025 at 17:13):

good question - I think uppercase, basically "unused tag"

view this post on Zulip Richard Feldman (Jul 28 2025 at 17:14):

in a hypothetical future where we give warnings for unused tags (e.g. I put it in my type alias or nominal type but I never actually construct it anywhere), this would be a way to be like "yeah I know but I still want to put it in here as a reminder or something"

view this post on Zulip Joshua Warner (Jul 28 2025 at 17:14):

"yeah I know but I still want to put it in here as a reminder or something"

Sounds like a great use case for a comment!

view this post on Zulip Joshua Warner (Jul 28 2025 at 17:15):

FWIW these are currently treated as NamedUnderscore which is basically always treated as if it were lowercase in the parser.

view this post on Zulip Austin Davis (Jul 28 2025 at 17:52):

Forgive my ignorance, but is this string-interning optimization supposed to apply to the Roc compiler's handling of Roc source code, or the runtime representation of dynamically allocated strings, or both?

I'm curious because I've heard about the latter being done before, but I'm not sure if I've actually seen it done in practice. I know Java has a java.lang.String.intern() method, but it definitely has a performance cost that grows with the number of interned strings. Just wondering if that feature is on the roadmap.

A more concrete reason I'm asking is because it might be interesting to tinker with dynamic string interning in my compiler project I'm writing in Roc.

view this post on Zulip Richard Feldman (Jul 28 2025 at 18:15):

just a compiler internal thing

view this post on Zulip Richard Feldman (Jul 28 2025 at 18:15):

wouldn't affect how running Roc programs behave

view this post on Zulip Austin Davis (Jul 28 2025 at 18:42):

Awesome, this is turning out to be such an impressive compiler IMO!

view this post on Zulip Brendan Hansknecht (Jul 28 2025 at 19:55):

Brendan Hansknecht said:

I wonder if this is worth the complexity currently. I also wonder how many identifiers fir in 4 1-127 bytes.

But overall seems reasonable.

When I say this, just to clarify... I more mean this in terms of spending time here right now vs farther down the compiler. Like we still have so much of the compiler that just doesn't exist or that is just uber slow and we theoretically want the interpreter fully functional with platforms for AOC.

So more a prioritization comment than a complexity one. Sure, this could be way faster, but that will be pretty small compared to getting the compiler working as a whole or compared to the slower stages of the compiler like type checking.

view this post on Zulip Richard Feldman (Jul 28 2025 at 21:23):

totally fair, although I like having Claude work on multiple implementations in parallel, and this one is well-scoped :smile:

view this post on Zulip Richard Feldman (Jul 28 2025 at 21:36):

Joshua Warner said:

We don't currently intern those. Neither does Carbon, but they do copy it to a string table so they don't have to reference the source (just no deduping).

I just realized region won't always work; if we have escapes in the literal, we have to resolve those

view this post on Zulip Joshua Warner (Jul 28 2025 at 21:37):

True

view this post on Zulip Joshua Warner (Jul 28 2025 at 21:37):

But that can be done just as well when we're about to use the string, as when we're tokenizing.

view this post on Zulip Richard Feldman (Jul 28 2025 at 21:37):

probably eventually for optimized builds we should dedupe anyway

view this post on Zulip Richard Feldman (Jul 28 2025 at 21:38):

eh we do need to report invalid escapes at compile time

view this post on Zulip Richard Feldman (Jul 28 2025 at 21:38):

so we need to do it even during roc check

view this post on Zulip Joshua Warner (Jul 28 2025 at 21:38):

Checking for invalid escapes should definitely be done during the tokenization

view this post on Zulip Joshua Warner (Jul 28 2025 at 21:39):

If we want to copy the string somewhere during tokenization, we might as well resolve escapes at the same time

view this post on Zulip Richard Feldman (Jul 28 2025 at 21:41):

agreed!

view this post on Zulip Joshua Warner (Jul 28 2025 at 21:42):

Actually let me walk that back a bit...

view this post on Zulip Joshua Warner (Jul 28 2025 at 21:43):

Imagine we're in the interpreter, and we have a absolute boatload of string constants we won't happen to use.

view this post on Zulip Joshua Warner (Jul 28 2025 at 21:44):

In that case, we are actually served well (perf-wise) by not validating until use

view this post on Zulip Joshua Warner (Jul 28 2025 at 21:44):

And arguably those error messages won't necessarily help the user either, since (again) that code isn't used.

view this post on Zulip Joshua Warner (Jul 28 2025 at 21:45):

Fully agree that the case of roc check should still give all error messages there tho, even if unused.


Last updated: Jun 16 2026 at 16:19 UTC