small string hashing · compiler development

I'm not actually sure the best way to solve this. Obviously there are many, but the fact that anyone can implement the hashing ability actually makes this a bit complicated.

I think that I would like to add a general: Str.withUtf8BytesUnsafe: Str, (List U8 -> out) -> out. This function would allow any roc users to access the bytes of a Str without an allocation. That said, it would have one huge caveat. If the List U8 is still referenced after the inner lambda returns. The function will crash. It would crash with a longer message explaining why. I think this would lead to a function that in practice will only ever crash during development. It will be very easy to iron out any kinks.

This could also return a result, but I think in practice the result isn't useful. If you are calling this function, you are doing so specifically for small string support. If you need small string support, you need this function to never fail.

The big reason I think we need a more general function is that anyone can write a hasher. The hasher has control of its own internal state. As such, there is no safe way to pass a small string as a List U8 to the hasher addBytes function. The implementation of that function could be:

addBytes = \@MyHasher { lazyByteQueue }, bytes ->
     @MyHasher { lazyByteQueue: queue.push LazyByteQueue bytes }

This would break any sort of guarantees around a small string that lives on the stack and temporarily is being referenced as a list. So hashing has no choice but to panic if someone were to write this hasher and it received a small string input.

In practice, I think this function would rarely lead to crashes except during development. I believe this cause it is vary rare to only occasionally capture an input list in a function like this. As such, it is much more likely they either always capture or never capture. If they do occasionally capture, if any tests hit that workflow, the crash will be seen during development instead of release.

What are peoples general thoughts on this? I think it is a useful enough tool that we should add it even with the potential crash. In the case of #beginners > performance question about 19% of the execution time was spent in allocating, copying, and freeing single character lists for hashing strings. About 2% of the time was spent in the actual hashing.

Any other ideas? My next best solution is adding a specific special case for List U8 that allows it to have small list optimization. It would use the one free bit on the length.

Brendan Hansknecht (Oct 23 2024 at 18:57):

Richard Feldman (Oct 23 2024 at 19:26):

I think there's some nicer way to do it with Iter U8 or maybe even something bigger like Iter U64 but I haven't thought it through all the way

Brendan Hansknecht (Oct 23 2024 at 20:45):

Except it also has to degrade down to smaller size as we reach the edge of the list

Brendan Hansknecht (Oct 23 2024 at 20:46):

And other hashing algorithms probably use different sizes, so I would be skeptical that would work out well.

Richard Feldman (Oct 23 2024 at 20:51):

yeah, overall my thought is that this seems like a scary thing to introduce to the language, and I'd really prefer to find an API that doesn't have that problem

Richard Feldman (Oct 23 2024 at 20:53):

for example, another possibility is just that we give the hasher [SmallStr U64, BigStr (List U8)]

Brendan Hansknecht (Oct 23 2024 at 22:24):

That doesn't quite work either. Cause a big string that is 3 bytes needs to hash the same as a small string that is 3 bytes.

Richard Feldman (Oct 23 2024 at 22:33):

like if it's a big string, before we pass it to hash, we check the length, and smallify it if appropriate

Brendan Hansknecht (Oct 23 2024 at 22:48):

Honestly, thinking about this more, I really think we should just implement small list u8

Brendan Hansknecht (Oct 23 2024 at 22:48):

Richard Feldman (Oct 23 2024 at 23:04):

Brendan Hansknecht (Oct 23 2024 at 23:18):

And since a u8 is too small to contain any sort of pointers, we don't have to worry about refcounting complications on small lists. I don't think I want to make a generic small list (from what I have seen in the c++ ecosystems it often isn't valuable for most types unless your list is allowed to be larger than 3 usizes) Since we are limited to 3 usizes including the 1 u8 being used for recording the size, I think it would be best to keep things simple and only special cases of 1 byte sized things.

Brendan Hansknecht (Oct 23 2024 at 23:19):

Though I guess from c++ llvm small vector, I think that anything 32bit or smaller on a 64bit system is pretty reasonable to put in a small list as well. But making it generic makes the rules a lot more complicated for hosts (and our builtins).

Brendan Hansknecht (Oct 23 2024 at 23:20):

So I think I want to only apply it to lists of 1 byte wide elements and keep things simple while still unblocking all things related to small strings and lists

Brendan Hansknecht (Oct 23 2024 at 23:21):

Richard Feldman (Oct 23 2024 at 23:43):

I keep trying to avoid that but it feels like it's crossed some threshold where it's worth it

Richard Feldman (Oct 23 2024 at 23:44):

Sam Mohr (Oct 23 2024 at 23:44):

Sam Mohr (Oct 23 2024 at 23:45):

I don't expect we pack Bool into smaller than a byte, though. Meaning we would just apply this to any List Byte

Brendan Hansknecht (Oct 23 2024 at 23:54):

Brendan Hansknecht (Oct 24 2024 at 00:00):

Definitely adds some extra complexity costs. Will mostly be felt by host authors (or glue authors once glue is good).

Brendan Hansknecht (Oct 24 2024 at 00:01):

I think llvm will always optimize the extra checks in zig away as long as the builtins get inlined.

Brendan Hansknecht (Oct 24 2024 at 00:08):

Even if they don't get inlined, I think llvm has a pass were is can inline constants and generate multiple variations of a function. So it may inline element_width and optimize that away even if the function doesn't fully inline

Brendan Hansknecht (Oct 24 2024 at 00:10):

Oskar Hahn (Oct 24 2024 at 06:10):

It was complicated to implement the list type in go. This will make it a bit more complicated. For that reason, it would be nice, if there is a clear rule, when a list can be small. It will be complicated to make the decision with glue, since the list implementation in many host languages will probably be generic. In go, the type is type RocList[t any] C.struct_RocList. This type does not know, if t is Bool, U8 or I8.

I would propose, that a list can be a small list, if the type has a size of one byte or less. This also includes opaque types and {a: U8}.

With a rule like this, I think, that many host languages can check the size of t. For example in go, I could use unsafe.Sizeof.

Brendan Hansknecht (Oct 24 2024 at 06:19):

Yeah, I think I would like to keep the rule simple. Small list optimization only happens if the size of t with padding is 1 byte. Small list optimization is then exactly the same as small string optimization.

Stream: compiler development

Topic: small string hashing

Brendan Hansknecht (Oct 21 2024 at 06:33):

Brendan Hansknecht (Oct 23 2024 at 18:57):

Richard Feldman (Oct 23 2024 at 19:26):

Brendan Hansknecht (Oct 23 2024 at 20:45):

Brendan Hansknecht (Oct 23 2024 at 20:45):

Brendan Hansknecht (Oct 23 2024 at 20:46):

Richard Feldman (Oct 23 2024 at 20:51):

Richard Feldman (Oct 23 2024 at 20:53):

Brendan Hansknecht (Oct 23 2024 at 22:24):

Richard Feldman (Oct 23 2024 at 22:33):

Richard Feldman (Oct 23 2024 at 22:33):

Brendan Hansknecht (Oct 23 2024 at 22:48):

Brendan Hansknecht (Oct 23 2024 at 22:48):

Richard Feldman (Oct 23 2024 at 23:04):

Brendan Hansknecht (Oct 23 2024 at 23:18):

Brendan Hansknecht (Oct 23 2024 at 23:19):

Brendan Hansknecht (Oct 23 2024 at 23:20):

Brendan Hansknecht (Oct 23 2024 at 23:21):

Richard Feldman (Oct 23 2024 at 23:43):

Richard Feldman (Oct 23 2024 at 23:43):

Richard Feldman (Oct 23 2024 at 23:44):

Sam Mohr (Oct 23 2024 at 23:44):

Sam Mohr (Oct 23 2024 at 23:44):

Sam Mohr (Oct 23 2024 at 23:45):

Brendan Hansknecht (Oct 23 2024 at 23:54):

Brendan Hansknecht (Oct 24 2024 at 00:00):

Brendan Hansknecht (Oct 24 2024 at 00:01):

Brendan Hansknecht (Oct 24 2024 at 00:08):

Brendan Hansknecht (Oct 24 2024 at 00:10):

Oskar Hahn (Oct 24 2024 at 06:10):

Brendan Hansknecht (Oct 24 2024 at 06:19):