Stream: compiler development

Topic: small string hashing


view this post on Zulip Brendan Hansknecht (Oct 21 2024 at 06:33):

it looks like small string hashing today leads to an allocation, a copy, hashing, a free

I'm not actually sure the best way to solve this. Obviously there are many, but the fact that anyone can implement the hashing ability actually makes this a bit complicated.

I think that I would like to add a general: Str.withUtf8BytesUnsafe: Str, (List U8 -> out) -> out. This function would allow any roc users to access the bytes of a Str without an allocation. That said, it would have one huge caveat. If the List U8 is still referenced after the inner lambda returns. The function will crash. It would crash with a longer message explaining why. I think this would lead to a function that in practice will only ever crash during development. It will be very easy to iron out any kinks.

This could also return a result, but I think in practice the result isn't useful. If you are calling this function, you are doing so specifically for small string support. If you need small string support, you need this function to never fail.


The big reason I think we need a more general function is that anyone can write a hasher. The hasher has control of its own internal state. As such, there is no safe way to pass a small string as a List U8 to the hasher addBytes function. The implementation of that function could be:

addBytes = \@MyHasher { lazyByteQueue }, bytes ->
     @MyHasher { lazyByteQueue: queue.push LazyByteQueue bytes }

This would break any sort of guarantees around a small string that lives on the stack and temporarily is being referenced as a list. So hashing has no choice but to panic if someone were to write this hasher and it received a small string input.

In practice, I think this function would rarely lead to crashes except during development. I believe this cause it is vary rare to only occasionally capture an input list in a function like this. As such, it is much more likely they either always capture or never capture. If they do occasionally capture, if any tests hit that workflow, the crash will be seen during development instead of release.


What are peoples general thoughts on this? I think it is a useful enough tool that we should add it even with the potential crash. In the case of #beginners > performance question about 19% of the execution time was spent in allocating, copying, and freeing single character lists for hashing strings. About 2% of the time was spent in the actual hashing.

Any other ideas? My next best solution is adding a specific special case for List U8 that allows it to have small list optimization. It would use the one free bit on the length.

view this post on Zulip Brendan Hansknecht (Oct 23 2024 at 18:57):

@Richard Feldman do you have any thoughts on this?

view this post on Zulip Richard Feldman (Oct 23 2024 at 19:26):

I think there's some nicer way to do it with Iter U8 or maybe even something bigger like Iter U64 but I haven't thought it through all the way

view this post on Zulip Brendan Hansknecht (Oct 23 2024 at 20:45):

In the case of hashing, I think it would be an Iter U384 for my use case

view this post on Zulip Brendan Hansknecht (Oct 23 2024 at 20:45):

Except it also has to degrade down to smaller size as we reach the edge of the list

view this post on Zulip Brendan Hansknecht (Oct 23 2024 at 20:46):

And other hashing algorithms probably use different sizes, so I would be skeptical that would work out well.

view this post on Zulip Richard Feldman (Oct 23 2024 at 20:51):

yeah, overall my thought is that this seems like a scary thing to introduce to the language, and I'd really prefer to find an API that doesn't have that problem

view this post on Zulip Richard Feldman (Oct 23 2024 at 20:53):

for example, another possibility is just that we give the hasher [SmallStr U64, BigStr (List U8)]

view this post on Zulip Brendan Hansknecht (Oct 23 2024 at 22:24):

That doesn't quite work either. Cause a big string that is 3 bytes needs to hash the same as a small string that is 3 bytes.

view this post on Zulip Richard Feldman (Oct 23 2024 at 22:33):

we could always convert to a SmallStr variant in that case

view this post on Zulip Richard Feldman (Oct 23 2024 at 22:33):

like if it's a big string, before we pass it to hash, we check the length, and smallify it if appropriate

view this post on Zulip Brendan Hansknecht (Oct 23 2024 at 22:48):

Honestly, thinking about this more, I really think we should just implement small list u8

view this post on Zulip Brendan Hansknecht (Oct 23 2024 at 22:48):

Fully removes this issue in all cases and we have the bit laying around

view this post on Zulip Richard Feldman (Oct 23 2024 at 23:04):

that really would simplify a lot of string stuff haha

view this post on Zulip Brendan Hansknecht (Oct 23 2024 at 23:18):

And since a u8 is too small to contain any sort of pointers, we don't have to worry about refcounting complications on small lists. I don't think I want to make a generic small list (from what I have seen in the c++ ecosystems it often isn't valuable for most types unless your list is allowed to be larger than 3 usizes) Since we are limited to 3 usizes including the 1 u8 being used for recording the size, I think it would be best to keep things simple and only special cases of 1 byte sized things.

view this post on Zulip Brendan Hansknecht (Oct 23 2024 at 23:19):

Though I guess from c++ llvm small vector, I think that anything 32bit or smaller on a 64bit system is pretty reasonable to put in a small list as well. But making it generic makes the rules a lot more complicated for hosts (and our builtins).

view this post on Zulip Brendan Hansknecht (Oct 23 2024 at 23:20):

So I think I want to only apply it to lists of 1 byte wide elements and keep things simple while still unblocking all things related to small strings and lists

view this post on Zulip Brendan Hansknecht (Oct 23 2024 at 23:21):

This would mean that a Str litterally will be byte for byte a list u8

view this post on Zulip Richard Feldman (Oct 23 2024 at 23:43):

yeah that sounds reasonable

view this post on Zulip Richard Feldman (Oct 23 2024 at 23:43):

I keep trying to avoid that but it feels like it's crossed some threshold where it's worth it

view this post on Zulip Richard Feldman (Oct 23 2024 at 23:44):

what about List I8?

view this post on Zulip Sam Mohr (Oct 23 2024 at 23:44):

Sounds like anything <= 8 bits would be a candidate

view this post on Zulip Sam Mohr (Oct 23 2024 at 23:44):

Aka U8, I8, and Bool

view this post on Zulip Sam Mohr (Oct 23 2024 at 23:45):

I don't expect we pack Bool into smaller than a byte, though. Meaning we would just apply this to any List Byte

view this post on Zulip Brendan Hansknecht (Oct 23 2024 at 23:54):

Yeah, would work for anything that is a single byte with padding

view this post on Zulip Brendan Hansknecht (Oct 24 2024 at 00:00):

Definitely adds some extra complexity costs. Will mostly be felt by host authors (or glue authors once glue is good).

view this post on Zulip Brendan Hansknecht (Oct 24 2024 at 00:01):

I think llvm will always optimize the extra checks in zig away as long as the builtins get inlined.

view this post on Zulip Brendan Hansknecht (Oct 24 2024 at 00:08):

Even if they don't get inlined, I think llvm has a pass were is can inline constants and generate multiple variations of a function. So it may inline element_width and optimize that away even if the function doesn't fully inline

view this post on Zulip Brendan Hansknecht (Oct 24 2024 at 00:10):

Yeah, I think it is ipsccp: Interprocedural Sparse Conditional Constant Propagationhttps://llvm.org/docs/Passes.html#ipsccp-interprocedural-sparse-conditional-constant-propagation

view this post on Zulip Oskar Hahn (Oct 24 2024 at 06:10):

It was complicated to implement the list type in go. This will make it a bit more complicated. For that reason, it would be nice, if there is a clear rule, when a list can be small. It will be complicated to make the decision with glue, since the list implementation in many host languages will probably be generic. In go, the type is type RocList[t any] C.struct_RocList. This type does not know, if t is Bool, U8 or I8.

I would propose, that a list can be a small list, if the type has a size of one byte or less. This also includes opaque types and {a: U8}.

With a rule like this, I think, that many host languages can check the size of t. For example in go, I could use unsafe.Sizeof.

view this post on Zulip Brendan Hansknecht (Oct 24 2024 at 06:19):

Yeah, I think I would like to keep the rule simple. Small list optimization only happens if the size of t with padding is 1 byte. Small list optimization is then exactly the same as small string optimization.


Last updated: Jul 06 2025 at 12:14 UTC