Reading about the two-stage lookup tables for Unicode properties, and the implementation notes suggest the following is an optimal method.
I've been scratching my head thinking about how to do this in Roc, and I'm not quite sure I really understand how this works.
My understanding is that instead of just having a list that is Num.MaxU32
in length and then doing a getProperty = \u32 -> List.get u32
, you could break this into two tables which are smaller, and then do a lookup on the first table to get the offset into the second table.
I'm probably getting tired, but how does this work? Is anyone able to provide a simple example of this?
I'm reasonably confident I can generate the tables using a script, and I feel like I could sample the desired values by looping through all possible values, but I think I'm missing a key ingredient here.
unsigned get_property(unsigned ch)
{
const unsigned BLOCK_SIZE = 256;
unsigned block_offset = stage1[ch / BLOCK_SIZE] * BLOCK_SIZE;
return stage2[block_offset + ch % BLOCK_SIZE];
}
IIRC a key detail here is that some (most?) of the blocks will actually be identical, and thus can be duplicated. That's the primary advantage of this method over just doing the large table look up.
I saw a blog post a while ago about someone building a better compression method for these tables. I couldn't find it tho.
Last updated: Jul 06 2025 at 12:14 UTC