@Joshua Warner continuing that topic we'd discussed briefly in the meetup, of storing small strings inline in the 4B interned IDs, here's a proposal for how that format could look:
_, suffix _, and suffix ! - any combination of these can be set; if all 3 are set, you get e.g. _blah!_ which would be an unused var storing an effectful function)so that means we can store _blah!_ inline in 4 bytes, we can still store a little over 50% of the total number of interned strings compared to before, and all we're paying are a few extra bit twiddling ops
plus probably some branch mispredictions here and there, but probably a lot of things can be made branchless
I wonder if this is worth the complexity currently. I also wonder how many identifiers fir in 4 1-127 bytes.
But overall seems reasonable.
I wonder if this is worth the complexity currently.
:plus:
Josh did a histogram of symbols on a test code base and it seems like this optimization would apply to more than half of them, so a pretty serious optimization :sweat_smile:
also the original motivation was that string interning was taking up most of the time of the tokenization phase
Not _most_ of the time - about 25%, but it is the single biggest chunk of time currently.
You could pack even more bits into there with some trickery :P
what would we use the extra bits for?
Storing longer identifiers of course!
:thinking: example?
Well, if we're sticking with ascii identifiers, there are < 64 valid chars in that bunch, and so we could pack 5-byte identifiers into 32 bits
Or even use a static huffman table :stuck_out_tongue_closed_eyes:
With a static huffman table, ~74% of identifiers could be compressed into 28 bits or less (leaving 3 bits for var/unused/etc, and 1 bit to mark when the identifier is in the table).
oh I see, so it would be like:
const Idx = packed struct(u32) {
is_inline: bool,
attributes: enum(u2) { unused, variable, fx, unused_fx },
first_char: u5,
other_chars: [4]u6,
}
u6 fits 64 chars, which is exactly enough for (26 lowercase + 26 uppercase + 10 digits + underscore + bang)
interestingly, the first char can fit in u5 (up to 32 possibilities) because the first character of an identifier can't be a digit (or else it would be a number literal), we'd store ! or _ in the attributes, and we can know if it's uppercase vs lowercase from context so we can just always store it as lowercase - so we only need 26 of the 32 possibilities.
this is interesting because actually it reveals an optimization I'd never thought of: store interned identifier strings as always being lowercase, because that way we can save storing e.g. both Serializer as well as serializer as separate interned strings; instead, we store both as the same string, they get the same ID, but because we always know from context whether we're expecting lowercase vs uppercase strings, we can always convert back to the appropriate one for display
also I realized that for attributes only those 4 combinations actually need to be stored; e.g. an unused var would be a warning that it's var for no reason since you never reassign it (and if you reassign it, it's no longer unused), and making a reassignable effectful function would probably come up ~never in practice, so no need to give it a speecial case
what I like about that vs a static huffman table is that we don't need to pay for extra memory for lookups; this all gets decoded using bit shifts etc in the memory we're already loading
that's really cool :smiley:
I don't think decoding will be common. That's mostly for debugging tooling and error messaging.
oh because we usually show it using Region instead? :thinking:
Yes, and even then that's only really for emitting error messages
And I guess things like symbol names in the binary :thinking:
(I don't think we should heavily optimize the error messaging code path - we should cap that at 100 per build or something, and it'll just never be an issue)
All this said - I'm not sure how far it's worth going down this road (yet), since it limits some potential future optimizations.
e.g. Carbon uses dense identifier indices that can be reasonably packed into 23 bits and stored _inline_ in the token representation.
how does that work?
They use 8 bits token kind, 1 bit "whitespace before", 23 bits "extra" (used for identifier id, literal id, etc depending on token type), and 32 bits for the byte offset in the file.
23 bits (~8 million) is reasonable when you're talking about a limit on unique identifiers
do we just store string literals as regions? or do we intern those too?
We don't currently intern those. Neither does Carbon, but they do copy it to a string table so they don't have to reference the source (just no deduping).
so we just use the region?
Yes
cool, that design seems good to me!
Semi-related question: would _MyUnusedFakeLowerIdent be considered lowercase? uppercase? error?
good question - I think uppercase, basically "unused tag"
in a hypothetical future where we give warnings for unused tags (e.g. I put it in my type alias or nominal type but I never actually construct it anywhere), this would be a way to be like "yeah I know but I still want to put it in here as a reminder or something"
"yeah I know but I still want to put it in here as a reminder or something"
Sounds like a great use case for a comment!
FWIW these are currently treated as NamedUnderscore which is basically always treated as if it were lowercase in the parser.
Forgive my ignorance, but is this string-interning optimization supposed to apply to the Roc compiler's handling of Roc source code, or the runtime representation of dynamically allocated strings, or both?
I'm curious because I've heard about the latter being done before, but I'm not sure if I've actually seen it done in practice. I know Java has a java.lang.String.intern() method, but it definitely has a performance cost that grows with the number of interned strings. Just wondering if that feature is on the roadmap.
A more concrete reason I'm asking is because it might be interesting to tinker with dynamic string interning in my compiler project I'm writing in Roc.
just a compiler internal thing
wouldn't affect how running Roc programs behave
Awesome, this is turning out to be such an impressive compiler IMO!
Brendan Hansknecht said:
I wonder if this is worth the complexity currently. I also wonder how many identifiers fir in 4 1-127 bytes.
But overall seems reasonable.
When I say this, just to clarify... I more mean this in terms of spending time here right now vs farther down the compiler. Like we still have so much of the compiler that just doesn't exist or that is just uber slow and we theoretically want the interpreter fully functional with platforms for AOC.
So more a prioritization comment than a complexity one. Sure, this could be way faster, but that will be pretty small compared to getting the compiler working as a whole or compared to the slower stages of the compiler like type checking.
totally fair, although I like having Claude work on multiple implementations in parallel, and this one is well-scoped :smile:
Joshua Warner said:
We don't currently intern those. Neither does Carbon, but they do copy it to a string table so they don't have to reference the source (just no deduping).
I just realized region won't always work; if we have escapes in the literal, we have to resolve those
True
But that can be done just as well when we're about to use the string, as when we're tokenizing.
probably eventually for optimized builds we should dedupe anyway
eh we do need to report invalid escapes at compile time
so we need to do it even during roc check
Checking for invalid escapes should definitely be done during the tokenization
If we want to copy the string somewhere during tokenization, we might as well resolve escapes at the same time
agreed!
Actually let me walk that back a bit...
Imagine we're in the interpreter, and we have a absolute boatload of string constants we won't happen to use.
In that case, we are actually served well (perf-wise) by not validating until use
And arguably those error messages won't necessarily help the user either, since (again) that code isn't used.
Fully agree that the case of roc check should still give all error messages there tho, even if unused.
Last updated: Jun 16 2026 at 16:19 UTC