hashing abstraction article · compiler development

Stream: compiler development

Topic: hashing abstraction article

Richard Feldman (Dec 12 2024 at 21:59):

this is interesting...something we could think about for our hashing API: https://purplesyringa.moe/blog/thoughts-on-rust-hashing/

Brendan Hansknecht (Dec 12 2024 at 22:35):

Only starting to skim this, but the author is already pointing out many of my complaints with the hashing api

Brendan Hansknecht (Dec 12 2024 at 22:36):

Though roc will not end up being accidentally quadradic (thanks index map)

Brendan Hansknecht (Dec 12 2024 at 22:39):

How this API should look like and whether it can be shoehorned into the existing interfaces remains to be seen. I have not started work on the design yet, and perhaps this article might be a bit premature

Oh, they don't really have any sort of suggestion or solution

Brendan Hansknecht (Dec 12 2024 at 22:40):

Also, this is why in roc, hasing a list of ints is super slow, but hashing an array of bytes is fast.

Brendan Hansknecht (Dec 12 2024 at 22:40):

We really need some sort of way to aggregate things. Especially all of our primitives types that could just be hashed as a giant chunk of bytes

Brendan Hansknecht (Dec 12 2024 at 22:42):

I tried at one point to do a streaming hasher (probably should try again), but the perf was pretty abysmal due to having to copy all bytes over to a separate list for accumulation. I think that probably could be done a lot smarter, but as mentioned in the artical, it is adhoc in an inconvenient way. This is especially true for fast hashers for hash tables

Brendan Hansknecht (Dec 12 2024 at 22:42):

For more robust and secure hashes, the copy matters a lot less

Brendan Hansknecht (Dec 12 2024 at 22:42):

All hash algorithms really just want to operate on a giant array of bytes. The more things that can be viewed that way, the better

Richard Feldman (Dec 12 2024 at 22:50):

yeah, one vague idea I had was to maybe allow our derived hashing implementations to cheat and view things as raw bytes even though a handwritten pure roc implementation couldn't do that

Brendan Hansknecht (Dec 12 2024 at 22:51):

yeah, though you probably shouldn't hash padding (though I guess you could, but would make hashing system dependent) and you also can't do that with more complex types that contain pointers.

Brendan Hansknecht (Dec 12 2024 at 22:52):

Also, I guess this is their core idea:

In my opinion, Hasher and Hash are a wrong abstraction. Instead of the Hash driving the Hasher ~~insane~~, it should be the other way round: Hash providing introspection facilities and Hasher navigating the hashed objects recursively. As a bonus, this could enable (opt-in) portable hashers.

Brendan Hansknecht (Dec 12 2024 at 22:56):

I feel like the idea here when mapped to roc would be to have the hasher use the hash spec to create a bespoke lambda that will hash in the correct block size. It will aggregate the input into blocks and would be smart enough to be able to aggregate many list elements into a single block instead of hashing each of them separate.

Brendan Hansknecht (Dec 12 2024 at 22:58):

so instead of thinking from smallest unit to biggest, it would be using a spec to think about the whole picture at once.

Brendan Hansknecht (Dec 12 2024 at 22:58):

Still not sure how it would work in practice, but worth messing around with the idea.

Richard Feldman (Dec 12 2024 at 23:00):

agreed!

Richard Feldman (Dec 12 2024 at 23:00):

we'll have a perfect opportunity to do that with static dispatch

Brendan Hansknecht (Dec 12 2024 at 23:01):

Also, adding small list optimization for list u8, probably would help a lot here. Then can just convert any numeric primitive straight to bytes without an allocation

Brendan Hansknecht (Dec 12 2024 at 23:05):

my biggest concern is that the logic to come up with the correct hashing would be quite complex and expensive. I guess if the logic generates a lambda, it could be run once and store in a dict object for example.

Brendan Hansknecht (Dec 18 2024 at 01:05):

I've been thinking about this more. I feel like for the general case for hashing a list of things, the best bet would be to hash it as an array of bytes, but have a bit of logic that when loading a chunk of the list will automatically zero out all padding bytes.

Brendan Hansknecht (Dec 18 2024 at 01:06):

The two main pain points are:

User defined hashes
Types that are more padding than not
Nested types.

But I'm the general case, this probably would be way faster.

Brendan Hansknecht (Dec 18 2024 at 01:07):

If user defined hashes was only allowed to decide extra bytes to consider padding (basically ignore some fields), it would also work in this form. But more complex user padding would still be problematic

Last updated: Aug 17 2025 at 12:14 UTC