Stream: beginners

Topic: Constant data structures for efficient lookup


view this post on Zulip Paul Stanley (Nov 27 2024 at 22:17):

I have another question relevant to the fiddling I'm doing with unicode.

The process of changing case in unicode is basically about looking up tables. U+61 ('a') maps to an uppercase value of U+41 ('A') ... and so forth. There are a few special cases -- but that's the gist. All very easy. But it's not algorithmically determinable--you simply have to look things up, which essentially means taking the UCD data and putting it into a form the module can handle. You should care a lot, however, about efficient access to the map at run time, because anything you do with a string is going to require at least one table lookup for every codepoint in the string.

The conventional approach to this would be to build some sort of efficiently accessible data structure which is simply compiled and then accessed. It is by definition completely immutable and fixed at compile time, and probably fixed for the next century. So you write something to read the UCD data and puts it into a usable form.

The unicode package currently takes a different approach: it transforms the data into functions, which then get compiled. So instead of having a constant table and a lookup function, it has a function which returns the right result by piling up boolean checks. All perfectly reasonable ... up to a point.

But taking this approach to casing starts to smell a bit (more than it already did). We end up with a when codepoint is statement with about 1500 branches before it gets to the default branch. And since I don't know how sly the compiler is about this, I worry about efficiency. If checking the entire lot is going to involve examining each branch, that's obviously going to be massively slow. On average (since most lookups will return zilch) 1500 checks per codepoint.

So ...

  1. Can I build a table (e.g. a Dict) at compile time and simply access it later? How? Is it as simple as just defining that dict in a module? I think of Dicts as structures that get created at run-time (which would be problematic), but maybe I'm wrong. It's obviously quite pointless and inefficient to build a function that builds a Dict every time it gets called, and I wouldn't really even want something that had to be constructed even once while the program is running: this should be a fixed in stone structure that is simply accessed when the program is in motion. I really want a static and immutable data structure which is fixed at compile time, and where the only thing that ever happens when the parts are moving is a lookup.

  2. Or am I worrying unnecessarily? If (say) the compiler was smart enough to know that this particular when ... is function could be compiled to a hash table, that would solve my problem ... and I'm simply assuming it's not without knowing anything about how these things are implemented internally. If my assumption that such a statement does not require n checks where n is the number of branches is wrong, I'd be OK with the current strategy.

I can think of other possible options which would retain the current "functional" approach but be more efficient (e.g. do a 2 stage check where stage 1 checks whether a codepoint is in one of the many blocks where there are no case mappings, and then delegate to smaller branch functions where the block is potentially "interesting"). But they are a bit ad hoc, and since the essential question ("How to construct large immutable lookup tables at compile time?") is really pretty fundamental to a lot of unicode-like things, it seems sensible to think about the essential issue fairly early on, rather than to continue to go down an approach which works fine (in terms of getting a correct result) but might hit a major brick-wall in terms of performance soon.

(FWIW, I looked superficially at the Zig library that someone wrote for this. They originally started with the approach that the Roc unicode package is currently taking, simply turning the data into functions and switch statements. They then abandoned that completely and rewrote the whole library to use more efficient structures (tries, mostly, I think) in the interest of speed ... and it produced a very significant (order of magnitude) speed up. I've also looked at the most-used OCaml library, and that uses statically defined constant arrays to construct the tables, which works fine because OCaml has "real" arrays. I have no idea how Lists are implemented internally in Roc, or whether they could be deployed for that purpose.)

view this post on Zulip Brendan Hansknecht (Nov 27 2024 at 22:35):

This is a mapping from codepoint to codepoint, so I32 to I32?

view this post on Zulip Brendan Hansknecht (Nov 27 2024 at 22:35):

And it is dense? As in, would make sense as an array (wouldn't have millions of empty entries or be gigantic)?

view this post on Zulip Luke Boswell (Nov 27 2024 at 22:42):

I remember asking a similar question at the time I was writing the first version. :grinning:

I can't remember exactly, but I thought we concluded it was fine and the compiler can handle it. I'll see if I can dig up the thread.

But interested to follow this thread again.

view this post on Zulip Richard Feldman (Nov 27 2024 at 23:03):

Paul Stanley said:

Can I build a table (e.g. a Dict) at compile time and simply access it later? How? Is it as simple as just defining that dict in a module? I think of Dicts as structures that get created at run-time (which would be problematic), but maybe I'm wrong. It's obviously quite pointless and inefficient to build a function that builds a Dict every time it gets called, and I wouldn't really even want something that had to be constructed even once while the program is running: this should be a fixed in stone structure that is simply accessed when the program is in motion. I really want a static and immutable data structure which is fixed at compile time, and where the only thing that ever happens when the parts are moving is a lookup.

so the current plan is to have this Just Work exactly the way you want it to, but it's not implemented yet. :big_smile:

view this post on Zulip Richard Feldman (Nov 27 2024 at 23:03):

the design is in #ideas > Compile time computation

view this post on Zulip Richard Feldman (Nov 27 2024 at 23:05):

the basic idea is that if you define a top-level constant like foo = ... then whatever that ... is gets fully evaluated at compile time automatically (and if it hits a crash or something, that also gets reported at compile time before the program even runs) and ultimately gets emitted as a static chunk of memory in the final binary.

view this post on Zulip Richard Feldman (Nov 27 2024 at 23:05):

since that hasn't been implemented yet, the way it works today is that if you say foo = ... then foo silently gets compiled behind the scenes to a function, which gets called every time you reference foo

view this post on Zulip Richard Feldman (Nov 27 2024 at 23:06):

personally, I actually like the idea of implementing it the way you ultimately want it to work, even though the performance will be bad now (because it regenerates the dictionary every time)

view this post on Zulip Richard Feldman (Nov 27 2024 at 23:06):

because it would give us an excellent test case for the feature once we've actually built it

view this post on Zulip Richard Feldman (Nov 27 2024 at 23:11):

that said, are you talking about putting capitalization directly in roc-lang/unicode?

view this post on Zulip Richard Feldman (Nov 27 2024 at 23:13):

we've talked about this in the past, and capitalization can't be done correctly using only Unicode information - e.g. "what is the capitalization of the letter i?" cannot be correctly answered without knowing whether you mean in English or in Turkish, which has different rules - the exact same Unicode code point for i capitalizes to I in English and İ in Turkish, and the exact same code point for I lowercases to i in English and ı in Turkish

view this post on Zulip Paul Stanley (Nov 27 2024 at 23:15):

It's a mapping from codepoint to codepoint (mostly) ... though occasionally from codepoint to a list of codepoints. It's not really dense no ... compared to the number of potential codepoints. It's dense in some parts (blocks where there are case differences), but there are large tracts of the space with no mappings at all.

view this post on Zulip Richard Feldman (Nov 27 2024 at 23:15):

I forget what we decided the last time I talked to @Luke Boswell about this - it was either to have a separate roc-lang/locale package or maybe to do something more granular, like having a module for different languages inside unicode (e.g. En.roc, Tr.roc, etc.)

view this post on Zulip Richard Feldman (Nov 27 2024 at 23:16):

Paul Stanley said:

It's a mapping from codepoint to codepoint (mostly) ... though occasionally from codepoint to a list of codepoints.

gotcha, so in those cases I guess we pick from within the list using the language code?

view this post on Zulip Richard Feldman (Nov 27 2024 at 23:17):

anyway, I guess the main point is that since it's impossible to correctly implement a capitalize : Str -> Str function, we need to figure out what to offer instead of that :big_smile:

view this post on Zulip Luke Boswell (Nov 27 2024 at 23:17):

We haven't really dug into it, I think we decided to defer it for later design.

view this post on Zulip Luke Boswell (Nov 27 2024 at 23:18):

I implemented the Locale thing in basic-cli so we've at least got a starting point. I decided to use just a Str instead of a tag union because it may be possible for new ones to be added in future, and I think any application should be able to handle that

view this post on Zulip Paul Stanley (Nov 27 2024 at 23:18):

There is a defined algorithm (which includes the mappings for Turkish, Azeri and Lithuanian. It's part of unicode and should be there, for sure.

view this post on Zulip Luke Boswell (Nov 27 2024 at 23:19):

https://github.com/roc-lang/basic-cli/blob/main/platform/Locale.roc

view this post on Zulip Paul Stanley (Nov 27 2024 at 23:22):

I haven't really done more than think lightly about how to represent Locales. There are some complexities there too (no surprises) ... and in particular a decision to be taken about how much is done with types. For the basic unicode case algorithms though the only thing that matters is the primary language. I'll definitely look at basic-cli.

view this post on Zulip Richard Feldman (Nov 27 2024 at 23:24):

Paul Stanley said:

For the basic unicode case algorithms though the only thing that matters is the primary language.

ah, as in if you know the desired language and also the code point to capitalize, with those two inputs the table gives you the correct output in one lookup?

view this post on Zulip Luke Boswell (Nov 27 2024 at 23:24):

I only did some light research. Dillon Kearn's work for Elm might be helpful https://package.elm-lang.org/packages/dillonkearns/elm-bcp47-language-tag/latest/

view this post on Zulip Brendan Hansknecht (Nov 27 2024 at 23:28):

Performance-wise, I would guess that the gigantic switch would perform poorly. A dense switch will convert to a jump table. A sparse switch is at best some sort of nested log n conditional, and is at worst a linear chain of branching conditionals.

view this post on Zulip Brendan Hansknecht (Nov 27 2024 at 23:29):

For some cases where the branching is too complex to turn into a map, the switch still makes the most sense. And it often is the simplest.

view this post on Zulip Brendan Hansknecht (Nov 27 2024 at 23:30):

For something like this, a dict from codepoint to list of codepoint (one per locale I assume) is probably the simplest reasonably performant solution

view this post on Zulip Brendan Hansknecht (Nov 27 2024 at 23:31):

If possible, would be better to go to a tuple of codepoints and avoid the extra indirection of a list. Or even go to a string to get small string optimization.

view this post on Zulip Brendan Hansknecht (Nov 27 2024 at 23:32):

Also, can you give more info on the ocaml solution? If everything is sparse, how does it map to arrays?

view this post on Zulip Paul Stanley (Nov 28 2024 at 08:31):

Brendan Hansknecht said:

Also, can you give more info on the ocaml solution? If everything is sparse, how does it map to arrays?

It's using a trie, constructed as an array of arrays of arrays. Unicode codepoints range from U+0000 to U+10FFFF. But many blocks of codepoints are for scripts which have no casing information. You can exploit this. Effectively:

So basically we're taking advantage of the fact that we're searching a sparse structure where most routes from byte to byte are dead ends, and we can simply record that. We have three levels of arrays, and if all of them were needed we'd have a massive structure: but in most cases there are no "children" and we can simply record that fact, not needing to construct the arrays that would be required if there were data in them.

view this post on Zulip Paul Stanley (Nov 28 2024 at 08:34):

On my original question, I think the right thing to do is probably construct (it's not hard) a variety of possible solutions and ... measure!

view this post on Zulip Brendan Hansknecht (Nov 28 2024 at 08:36):

I see. Makes sense. Would likely be a lot faster than any sort of hashing. Memory is slow, but 3 lookups is probably faster than hashing and looking up.

view this post on Zulip Brendan Hansknecht (Nov 28 2024 at 08:36):

In roc today, I would probably represent that as a triple nested when ... is where each branch goes over the full 256 possible values

view this post on Zulip Brendan Hansknecht (Nov 28 2024 at 08:37):

Triple nested lists would also be ok, but roc doesn't generate nested list constants efficiently today....so, not optimal.

view this post on Zulip Paul Stanley (Nov 28 2024 at 08:37):

Brendan Hansknecht said:

In roc today, I would probably represent that as a triple nested when ... is where each branch goes over the full 256 possible values

That's a brilliant idea!

view this post on Zulip Brendan Hansknecht (Nov 28 2024 at 08:40):

I think llvm should generate you a fast jump table for that. If it doesn't, I'm sure we can poke at the code gen to make it happy.

view this post on Zulip Paul Stanley (Nov 29 2024 at 08:47):

Short update.

I implemented both a single giant pattern-matching statement and a three-stage one.

I attempted to time this by placing a Utc.now! on either side of some code which called a function that executes lookups over a short-ish (c 15000 codepoint) mixed piece of unicode, converted to a list of codepoints before I began timing, and assigns them to variables. And did this several times. The sample contained a good range of characters which would have lookup results and wouldn't. I reported the delta with Utc.deltaAsMillis.

The result was ... totally inconclusive. A tiny average difference between the "trie" version and the single version. I didn't collect results systematically, but we were looking at c. 80 ms for either version.

HOWEVER, something is fishy:

The weird thing: all of these were taking the same time near as makes no difference. So either:

(I also tried timing a "dummy" identity function, which simply returned its input ... and got more or less the same timing. So ... my test is certainly dominated by things other than any actual logic.)

Since the trie version is no harder to do now that I've written the code required to produce it, should be better, and doesn't look worse, I'll probably stick with it. But I suppose the usual rules about optimization apply!

view this post on Zulip Brendan Hansknecht (Nov 29 2024 at 18:59):

Do you have code I can benchmark?

view this post on Zulip Paul Stanley (Dec 01 2024 at 17:46):

I will have to do some reconstruction ... because I've moved things on quite a while since I had the test code set up. Not urgent though!

view this post on Zulip Brendan Hansknecht (Dec 01 2024 at 18:34):

No worries. In general, I know the sharp edges of roc performance and generally can identify them pretty quick with benchmarking.


Last updated: Jul 06 2025 at 12:14 UTC