Constant data structures for efficient lookup · beginners

Stream: beginners

Topic: Constant data structures for efficient lookup

Paul Stanley (Nov 27 2024 at 22:17):

I have another question relevant to the fiddling I'm doing with unicode.

The process of changing case in unicode is basically about looking up tables. U+61 ('a') maps to an uppercase value of U+41 ('A') ... and so forth. There are a few special cases -- but that's the gist. All very easy. But it's not algorithmically determinable--you simply have to look things up, which essentially means taking the UCD data and putting it into a form the module can handle. You should care a lot, however, about efficient access to the map at run time, because anything you do with a string is going to require at least one table lookup for every codepoint in the string.

The conventional approach to this would be to build some sort of efficiently accessible data structure which is simply compiled and then accessed. It is by definition completely immutable and fixed at compile time, and probably fixed for the next century. So you write something to read the UCD data and puts it into a usable form.

The unicode package currently takes a different approach: it transforms the data into functions, which then get compiled. So instead of having a constant table and a lookup function, it has a function which returns the right result by piling up boolean checks. All perfectly reasonable ... up to a point.

But taking this approach to casing starts to smell a bit (more than it already did). We end up with a when codepoint is statement with about 1500 branches before it gets to the default branch. And since I don't know how sly the compiler is about this, I worry about efficiency. If checking the entire lot is going to involve examining each branch, that's obviously going to be massively slow. On average (since most lookups will return zilch) 1500 checks per codepoint.

So ...

Can I build a table (e.g. a Dict) at compile time and simply access it later? How? Is it as simple as just defining that dict in a module? I think of Dicts as structures that get created at run-time (which would be problematic), but maybe I'm wrong. It's obviously quite pointless and inefficient to build a function that builds a Dict every time it gets called, and I wouldn't really even want something that had to be constructed even once while the program is running: this should be a fixed in stone structure that is simply accessed when the program is in motion. I really want a static and immutable data structure which is fixed at compile time, and where the only thing that ever happens when the parts are moving is a lookup.
Or am I worrying unnecessarily? If (say) the compiler was smart enough to know that this particular when ... is function could be compiled to a hash table, that would solve my problem ... and I'm simply assuming it's not without knowing anything about how these things are implemented internally. If my assumption that such a statement does not require n checks where n is the number of branches is wrong, I'd be OK with the current strategy.

I can think of other possible options which would retain the current "functional" approach but be more efficient (e.g. do a 2 stage check where stage 1 checks whether a codepoint is in one of the many blocks where there are no case mappings, and then delegate to smaller branch functions where the block is potentially "interesting"). But they are a bit ad hoc, and since the essential question ("How to construct large immutable lookup tables at compile time?") is really pretty fundamental to a lot of unicode-like things, it seems sensible to think about the essential issue fairly early on, rather than to continue to go down an approach which works fine (in terms of getting a correct result) but might hit a major brick-wall in terms of performance soon.

(FWIW, I looked superficially at the Zig library that someone wrote for this. They originally started with the approach that the Roc unicode package is currently taking, simply turning the data into functions and switch statements. They then abandoned that completely and rewrote the whole library to use more efficient structures (tries, mostly, I think) in the interest of speed ... and it produced a very significant (order of magnitude) speed up. I've also looked at the most-used OCaml library, and that uses statically defined constant arrays to construct the tables, which works fine because OCaml has "real" arrays. I have no idea how Lists are implemented internally in Roc, or whether they could be deployed for that purpose.)

Brendan Hansknecht (Nov 27 2024 at 22:35):

This is a mapping from codepoint to codepoint, so I32 to I32?

Brendan Hansknecht (Nov 27 2024 at 22:35):

And it is dense? As in, would make sense as an array (wouldn't have millions of empty entries or be gigantic)?

Luke Boswell (Nov 27 2024 at 22:42):

I remember asking a similar question at the time I was writing the first version. :grinning:

I can't remember exactly, but I thought we concluded it was fine and the compiler can handle it. I'll see if I can dig up the thread.

But interested to follow this thread again.

Richard Feldman (Nov 27 2024 at 23:03):

Paul Stanley said:

Can I build a table (e.g. a Dict) at compile time and simply access it later? How? Is it as simple as just defining that dict in a module? I think of Dicts as structures that get created at run-time (which would be problematic), but maybe I'm wrong. It's obviously quite pointless and inefficient to build a function that builds a Dict every time it gets called, and I wouldn't really even want something that had to be constructed even once while the program is running: this should be a fixed in stone structure that is simply accessed when the program is in motion. I really want a static and immutable data structure which is fixed at compile time, and where the only thing that ever happens when the parts are moving is a lookup.

so the current plan is to have this Just Work exactly the way you want it to, but it's not implemented yet. :big_smile:

Richard Feldman (Nov 27 2024 at 23:03):

the design is in #ideas > Compile time computation

Richard Feldman (Nov 27 2024 at 23:05):

the basic idea is that if you define a top-level constant like foo = ... then whatever that ... is gets fully evaluated at compile time automatically (and if it hits a crash or something, that also gets reported at compile time before the program even runs) and ultimately gets emitted as a static chunk of memory in the final binary.

Richard Feldman (Nov 27 2024 at 23:05):

since that hasn't been implemented yet, the way it works today is that if you say foo = ... then foo silently gets compiled behind the scenes to a function, which gets called every time you reference foo

Richard Feldman (Nov 27 2024 at 23:06):

personally, I actually like the idea of implementing it the way you ultimately want it to work, even though the performance will be bad now (because it regenerates the dictionary every time)

Richard Feldman (Nov 27 2024 at 23:06):

because it would give us an excellent test case for the feature once we've actually built it

Richard Feldman (Nov 27 2024 at 23:11):

that said, are you talking about putting capitalization directly in roc-lang/unicode?

Richard Feldman (Nov 27 2024 at 23:13):

we've talked about this in the past, and capitalization can't be done correctly using only Unicode information - e.g. "what is the capitalization of the letter i?" cannot be correctly answered without knowing whether you mean in English or in Turkish, which has different rules - the exact same Unicode code point for i capitalizes to I in English and İ in Turkish, and the exact same code point for I lowercases to i in English and ı in Turkish

Paul Stanley (Nov 27 2024 at 23:15):

It's a mapping from codepoint to codepoint (mostly) ... though occasionally from codepoint to a list of codepoints. It's not really dense no ... compared to the number of potential codepoints. It's dense in some parts (blocks where there are case differences), but there are large tracts of the space with no mappings at all.

Richard Feldman (Nov 27 2024 at 23:15):

I forget what we decided the last time I talked to @Luke Boswell about this - it was either to have a separate roc-lang/locale package or maybe to do something more granular, like having a module for different languages inside unicode (e.g. En.roc, Tr.roc, etc.)

Richard Feldman (Nov 27 2024 at 23:16):

Paul Stanley said:

It's a mapping from codepoint to codepoint (mostly) ... though occasionally from codepoint to a list of codepoints.

gotcha, so in those cases I guess we pick from within the list using the language code?

Richard Feldman (Nov 27 2024 at 23:17):

anyway, I guess the main point is that since it's impossible to correctly implement a capitalize : Str -> Str function, we need to figure out what to offer instead of that :big_smile:

Luke Boswell (Nov 27 2024 at 23:17):

We haven't really dug into it, I think we decided to defer it for later design.

Luke Boswell (Nov 27 2024 at 23:18):

I implemented the Locale thing in basic-cli so we've at least got a starting point. I decided to use just a Str instead of a tag union because it may be possible for new ones to be added in future, and I think any application should be able to handle that

Paul Stanley (Nov 27 2024 at 23:18):

There is a defined algorithm (which includes the mappings for Turkish, Azeri and Lithuanian. It's part of unicode and should be there, for sure.

Luke Boswell (Nov 27 2024 at 23:19):

https://github.com/roc-lang/basic-cli/blob/main/platform/Locale.roc

Paul Stanley (Nov 27 2024 at 23:22):

I haven't really done more than think lightly about how to represent Locales. There are some complexities there too (no surprises) ... and in particular a decision to be taken about how much is done with types. For the basic unicode case algorithms though the only thing that matters is the primary language. I'll definitely look at basic-cli.

Richard Feldman (Nov 27 2024 at 23:24):

Paul Stanley said:

For the basic unicode case algorithms though the only thing that matters is the primary language.

ah, as in if you know the desired language and also the code point to capitalize, with those two inputs the table gives you the correct output in one lookup?

Luke Boswell (Nov 27 2024 at 23:24):

I only did some light research. Dillon Kearn's work for Elm might be helpful https://package.elm-lang.org/packages/dillonkearns/elm-bcp47-language-tag/latest/

Brendan Hansknecht (Nov 27 2024 at 23:28):

Performance-wise, I would guess that the gigantic switch would perform poorly. A dense switch will convert to a jump table. A sparse switch is at best some sort of nested log n conditional, and is at worst a linear chain of branching conditionals.

Brendan Hansknecht (Nov 27 2024 at 23:29):

For some cases where the branching is too complex to turn into a map, the switch still makes the most sense. And it often is the simplest.

Brendan Hansknecht (Nov 27 2024 at 23:30):

For something like this, a dict from codepoint to list of codepoint (one per locale I assume) is probably the simplest reasonably performant solution

Brendan Hansknecht (Nov 27 2024 at 23:31):

If possible, would be better to go to a tuple of codepoints and avoid the extra indirection of a list. Or even go to a string to get small string optimization.

Brendan Hansknecht (Nov 27 2024 at 23:32):

Also, can you give more info on the ocaml solution? If everything is sparse, how does it map to arrays?

Paul Stanley (Nov 28 2024 at 08:31):

Brendan Hansknecht said:

Also, can you give more info on the ocaml solution? If everything is sparse, how does it map to arrays?

It's using a trie, constructed as an array of arrays of arrays. Unicode codepoints range from U+0000 to U+10FFFF. But many blocks of codepoints are for scripts which have no casing information. You can exploit this. Effectively:

First index the first byte into an array. If there's no entry for any codepoint "below" that byte, return null ... otherwise return an array for the second byte. So, for example if we were looking up 01F001 we would examine simply 01 and get null -- so we know we won't have any mapping, but if we were looking up 000061 or 002150 we would get an array at level 2.
Next index the second byte into this array. Same story. In this case 000061 would return a third level array, but 002150 would return null, because there are no codepoints in the 21xx range with any casing information.
Finally index into the last array to get the value. In this case, 0061 (lower-case a) would return 0041, which is the uppercase value.

So basically we're taking advantage of the fact that we're searching a sparse structure where most routes from byte to byte are dead ends, and we can simply record that. We have three levels of arrays, and if all of them were needed we'd have a massive structure: but in most cases there are no "children" and we can simply record that fact, not needing to construct the arrays that would be required if there were data in them.

Paul Stanley (Nov 28 2024 at 08:34):

On my original question, I think the right thing to do is probably construct (it's not hard) a variety of possible solutions and ... measure!

Brendan Hansknecht (Nov 28 2024 at 08:36):

I see. Makes sense. Would likely be a lot faster than any sort of hashing. Memory is slow, but 3 lookups is probably faster than hashing and looking up.

Brendan Hansknecht (Nov 28 2024 at 08:36):

In roc today, I would probably represent that as a triple nested when ... is where each branch goes over the full 256 possible values

Brendan Hansknecht (Nov 28 2024 at 08:37):

Triple nested lists would also be ok, but roc doesn't generate nested list constants efficiently today....so, not optimal.

Paul Stanley (Nov 28 2024 at 08:37):

Brendan Hansknecht said:

In roc today, I would probably represent that as a triple nested when ... is where each branch goes over the full 256 possible values

That's a brilliant idea!

Brendan Hansknecht (Nov 28 2024 at 08:40):

I think llvm should generate you a fast jump table for that. If it doesn't, I'm sure we can poke at the code gen to make it happy.

Paul Stanley (Nov 29 2024 at 08:47):

Short update.

I implemented both a single giant pattern-matching statement and a three-stage one.

I attempted to time this by placing a Utc.now! on either side of some code which called a function that executes lookups over a short-ish (c 15000 codepoint) mixed piece of unicode, converted to a list of codepoints before I began timing, and assigns them to variables. And did this several times. The sample contained a good range of characters which would have lookup results and wouldn't. I reported the delta with Utc.deltaAsMillis.

The result was ... totally inconclusive. A tiny average difference between the "trie" version and the single version. I didn't collect results systematically, but we were looking at c. 80 ms for either version.

HOWEVER, something is fishy:

I tried testing a function that actually does something substantial (building a new list with each codepoint: lots of List.append and List.join with no attempt to optimise).
I tried testing a function that simply examines each codepoint and determines whether it has a mapping, returning true if every codepoint has a mapping.
I tried the same thing, but with List.walkUntil terminating both when a codepoint did, and when it didn't have a mapping and reporting the point it had got to. That function examined only between 1 and 7 codepoints.

The weird thing: all of these were taking the same time near as makes no difference. So either:

I'm making some mistake timing this.
The lookup is incredibly fast in either version, and the time is being dominated simply by the time taken to call functions and allocate memory for variables and I'd need a much longer sample test to see any differences.

(I also tried timing a "dummy" identity function, which simply returned its input ... and got more or less the same timing. So ... my test is certainly dominated by things other than any actual logic.)

Since the trie version is no harder to do now that I've written the code required to produce it, should be better, and doesn't look worse, I'll probably stick with it. But I suppose the usual rules about optimization apply!

Brendan Hansknecht (Nov 29 2024 at 18:59):

Do you have code I can benchmark?

Paul Stanley (Dec 01 2024 at 17:46):

I will have to do some reconstruction ... because I've moved things on quite a while since I had the test code set up. Not urgent though!

Brendan Hansknecht (Dec 01 2024 at 18:34):

No worries. In general, I know the sharp edges of roc performance and generally can identify them pretty quick with benchmarking.

Last updated: Jul 26 2025 at 12:14 UTC