I have another question relevant to the fiddling I'm doing with unicode.
The process of changing case in unicode is basically about looking up tables. U+61 ('a') maps to an uppercase value of U+41 ('A') ... and so forth. There are a few special cases -- but that's the gist. All very easy. But it's not algorithmically determinable--you simply have to look things up, which essentially means taking the UCD data and putting it into a form the module can handle. You should care a lot, however, about efficient access to the map at run time, because anything you do with a string is going to require at least one table lookup for every codepoint in the string.
The conventional approach to this would be to build some sort of efficiently accessible data structure which is simply compiled and then accessed. It is by definition completely immutable and fixed at compile time, and probably fixed for the next century. So you write something to read the UCD data and puts it into a usable form.
The unicode package currently takes a different approach: it transforms the data into functions, which then get compiled. So instead of having a constant table and a lookup function, it has a function which returns the right result by piling up boolean checks. All perfectly reasonable ... up to a point.
But taking this approach to casing starts to smell a bit (more than it already did). We end up with a when codepoint is
statement with about 1500 branches before it gets to the default branch. And since I don't know how sly the compiler is about this, I worry about efficiency. If checking the entire lot is going to involve examining each branch, that's obviously going to be massively slow. On average (since most lookups will return zilch) 1500 checks per codepoint.
So ...
Can I build a table (e.g. a Dict) at compile time and simply access it later? How? Is it as simple as just defining that dict in a module? I think of Dicts as structures that get created at run-time (which would be problematic), but maybe I'm wrong. It's obviously quite pointless and inefficient to build a function that builds a Dict every time it gets called, and I wouldn't really even want something that had to be constructed even once while the program is running: this should be a fixed in stone structure that is simply accessed when the program is in motion. I really want a static and immutable data structure which is fixed at compile time, and where the only thing that ever happens when the parts are moving is a lookup.
Or am I worrying unnecessarily? If (say) the compiler was smart enough to know that this particular when ... is
function could be compiled to a hash table, that would solve my problem ... and I'm simply assuming it's not without knowing anything about how these things are implemented internally. If my assumption that such a statement does not require n checks where n is the number of branches is wrong, I'd be OK with the current strategy.
I can think of other possible options which would retain the current "functional" approach but be more efficient (e.g. do a 2 stage check where stage 1 checks whether a codepoint is in one of the many blocks where there are no case mappings, and then delegate to smaller branch functions where the block is potentially "interesting"). But they are a bit ad hoc, and since the essential question ("How to construct large immutable lookup tables at compile time?") is really pretty fundamental to a lot of unicode-like things, it seems sensible to think about the essential issue fairly early on, rather than to continue to go down an approach which works fine (in terms of getting a correct result) but might hit a major brick-wall in terms of performance soon.
(FWIW, I looked superficially at the Zig library that someone wrote for this. They originally started with the approach that the Roc unicode package is currently taking, simply turning the data into functions and switch statements. They then abandoned that completely and rewrote the whole library to use more efficient structures (tries, mostly, I think) in the interest of speed ... and it produced a very significant (order of magnitude) speed up. I've also looked at the most-used OCaml library, and that uses statically defined constant arrays to construct the tables, which works fine because OCaml has "real" arrays. I have no idea how Lists are implemented internally in Roc, or whether they could be deployed for that purpose.)
This is a mapping from codepoint to codepoint, so I32 to I32?
And it is dense? As in, would make sense as an array (wouldn't have millions of empty entries or be gigantic)?
I remember asking a similar question at the time I was writing the first version. :grinning:
I can't remember exactly, but I thought we concluded it was fine and the compiler can handle it. I'll see if I can dig up the thread.
But interested to follow this thread again.
Paul Stanley said:
Can I build a table (e.g. a Dict) at compile time and simply access it later? How? Is it as simple as just defining that dict in a module? I think of Dicts as structures that get created at run-time (which would be problematic), but maybe I'm wrong. It's obviously quite pointless and inefficient to build a function that builds a Dict every time it gets called, and I wouldn't really even want something that had to be constructed even once while the program is running: this should be a fixed in stone structure that is simply accessed when the program is in motion. I really want a static and immutable data structure which is fixed at compile time, and where the only thing that ever happens when the parts are moving is a lookup.
so the current plan is to have this Just Work exactly the way you want it to, but it's not implemented yet. :big_smile:
the design is in #ideas > Compile time computation
the basic idea is that if you define a top-level constant like foo = ...
then whatever that ...
is gets fully evaluated at compile time automatically (and if it hits a crash
or something, that also gets reported at compile time before the program even runs) and ultimately gets emitted as a static chunk of memory in the final binary.
since that hasn't been implemented yet, the way it works today is that if you say foo = ...
then foo
silently gets compiled behind the scenes to a function, which gets called every time you reference foo
personally, I actually like the idea of implementing it the way you ultimately want it to work, even though the performance will be bad now (because it regenerates the dictionary every time)
because it would give us an excellent test case for the feature once we've actually built it
that said, are you talking about putting capitalization directly in roc-lang/unicode
?
we've talked about this in the past, and capitalization can't be done correctly using only Unicode information - e.g. "what is the capitalization of the letter i
?" cannot be correctly answered without knowing whether you mean in English or in Turkish, which has different rules - the exact same Unicode code point for i
capitalizes to I
in English and İ
in Turkish, and the exact same code point for I
lowercases to i
in English and ı
in Turkish
It's a mapping from codepoint to codepoint (mostly) ... though occasionally from codepoint to a list of codepoints. It's not really dense no ... compared to the number of potential codepoints. It's dense in some parts (blocks where there are case differences), but there are large tracts of the space with no mappings at all.
I forget what we decided the last time I talked to @Luke Boswell about this - it was either to have a separate roc-lang/locale
package or maybe to do something more granular, like having a module for different languages inside unicode
(e.g. En.roc
, Tr.roc
, etc.)
Paul Stanley said:
It's a mapping from codepoint to codepoint (mostly) ... though occasionally from codepoint to a list of codepoints.
gotcha, so in those cases I guess we pick from within the list using the language code?
anyway, I guess the main point is that since it's impossible to correctly implement a capitalize : Str -> Str
function, we need to figure out what to offer instead of that :big_smile:
We haven't really dug into it, I think we decided to defer it for later design.
I implemented the Locale
thing in basic-cli so we've at least got a starting point. I decided to use just a Str
instead of a tag union because it may be possible for new ones to be added in future, and I think any application should be able to handle that
There is a defined algorithm (which includes the mappings for Turkish, Azeri and Lithuanian. It's part of unicode and should be there, for sure.
https://github.com/roc-lang/basic-cli/blob/main/platform/Locale.roc
I haven't really done more than think lightly about how to represent Locales. There are some complexities there too (no surprises) ... and in particular a decision to be taken about how much is done with types. For the basic unicode case algorithms though the only thing that matters is the primary language. I'll definitely look at basic-cli.
Paul Stanley said:
For the basic unicode case algorithms though the only thing that matters is the primary language.
ah, as in if you know the desired language and also the code point to capitalize, with those two inputs the table gives you the correct output in one lookup?
I only did some light research. Dillon Kearn's work for Elm might be helpful https://package.elm-lang.org/packages/dillonkearns/elm-bcp47-language-tag/latest/
Performance-wise, I would guess that the gigantic switch would perform poorly. A dense switch will convert to a jump table. A sparse switch is at best some sort of nested log n conditional, and is at worst a linear chain of branching conditionals.
For some cases where the branching is too complex to turn into a map, the switch still makes the most sense. And it often is the simplest.
For something like this, a dict from codepoint to list of codepoint (one per locale I assume) is probably the simplest reasonably performant solution
If possible, would be better to go to a tuple of codepoints and avoid the extra indirection of a list. Or even go to a string to get small string optimization.
Also, can you give more info on the ocaml solution? If everything is sparse, how does it map to arrays?
Brendan Hansknecht said:
Also, can you give more info on the ocaml solution? If everything is sparse, how does it map to arrays?
It's using a trie, constructed as an array of arrays of arrays. Unicode codepoints range from U+0000 to U+10FFFF. But many blocks of codepoints are for scripts which have no casing information. You can exploit this. Effectively:
First index the first byte into an array. If there's no entry for any codepoint "below" that byte, return null ... otherwise return an array for the second byte. So, for example if we were looking up 01F001 we would examine simply 01 and get null -- so we know we won't have any mapping, but if we were looking up 000061 or 002150 we would get an array at level 2.
Next index the second byte into this array. Same story. In this case 000061 would return a third level array, but 002150 would return null, because there are no codepoints in the 21xx range with any casing information.
Finally index into the last array to get the value. In this case, 0061 (lower-case a) would return 0041, which is the uppercase value.
So basically we're taking advantage of the fact that we're searching a sparse structure where most routes from byte to byte are dead ends, and we can simply record that. We have three levels of arrays, and if all of them were needed we'd have a massive structure: but in most cases there are no "children" and we can simply record that fact, not needing to construct the arrays that would be required if there were data in them.
On my original question, I think the right thing to do is probably construct (it's not hard) a variety of possible solutions and ... measure!
I see. Makes sense. Would likely be a lot faster than any sort of hashing. Memory is slow, but 3 lookups is probably faster than hashing and looking up.
In roc today, I would probably represent that as a triple nested when ... is
where each branch goes over the full 256 possible values
Triple nested lists would also be ok, but roc doesn't generate nested list constants efficiently today....so, not optimal.
Brendan Hansknecht said:
In roc today, I would probably represent that as a triple nested
when ... is
where each branch goes over the full 256 possible values
That's a brilliant idea!
I think llvm should generate you a fast jump table for that. If it doesn't, I'm sure we can poke at the code gen to make it happy.
Short update.
I implemented both a single giant pattern-matching statement and a three-stage one.
I attempted to time this by placing a Utc.now!
on either side of some code which called a function that executes lookups over a short-ish (c 15000 codepoint) mixed piece of unicode, converted to a list of codepoints before I began timing, and assigns them to variables. And did this several times. The sample contained a good range of characters which would have lookup results and wouldn't. I reported the delta with Utc.deltaAsMillis
.
The result was ... totally inconclusive. A tiny average difference between the "trie" version and the single version. I didn't collect results systematically, but we were looking at c. 80 ms for either version.
HOWEVER, something is fishy:
List.append
and List.join
with no attempt to optimise).List.walkUntil
terminating both when a codepoint did, and when it didn't have a mapping and reporting the point it had got to. That function examined only between 1 and 7 codepoints.The weird thing: all of these were taking the same time near as makes no difference. So either:
(I also tried timing a "dummy" identity function, which simply returned its input ... and got more or less the same timing. So ... my test is certainly dominated by things other than any actual logic.)
Since the trie version is no harder to do now that I've written the code required to produce it, should be better, and doesn't look worse, I'll probably stick with it. But I suppose the usual rules about optimization apply!
Do you have code I can benchmark?
I will have to do some reconstruction ... because I've moved things on quite a while since I had the test code set up. Not urgent though!
No worries. In general, I know the sharp edges of roc performance and generally can identify them pretty quick with benchmarking.
Last updated: Jul 06 2025 at 12:14 UTC