I just had an idea about hash flooding attacks - what if we encouraged decoding HTTP headers in servers by using the Decode ability to directly parse only the headers you care about?
in other words, rather than doing what most languages do (parsing the header bytes into a dictionary, and then letting you access that dictionary and do further decoding kinda like you do with environment variables, where you parse individual header strings into numbers/booleans/etc as needed), the server provides some function along the lines of:
headers : Request -> val | val has Decode
and it parses the raw header bytes directly into whatever data structure you want - e.g. let's say I have a record of type { contentLength : U64, contentType : Str }, the decoder could translate camelCase record field names into Capitalized-Kebab-Case, and just parse out exactly the headers I'm looking for in a single pass through the bytes
this would mean:
alternatively, if you need to access all the headers (because for whatever reason you want to iterate over them) you can always ask for them as a plain List, which is both the most efficient way to do that and also not vulnerable to hash flooding either
How does the decode ability work with ignoring data? Also, how does decode work with scanning once while ignoring field ordering? Just haven't thought about those depths of it
I really like the idea extemporalgenome shared on GH: parse the common ones into a struct (constant time lookups, yay!) and hold the rest in some other data structure.
That said, I don't see how the "other data structure" with this signature can be something other than either an array of bytes or a linked list, both of which are gonna exhibit the perf characteristics that we wanted to avoid. (Unless the initial scan pass just found byte offsets? How would this actually work?)
yeah and the hashing required just to do the insertions alone can be enough for the attack to succeed
Due to deciding on index map as the base of our dict, the hashing alg doesn't matter, correct? As such, to stop has flooding, we just need to seed the alg with random data when creating a dict and then it becomes super hard to flood due to being unique. Flooding requires knowing the target. Also, setting practically limits on where to stop loading headers should help.
Just like it is normal to set limits to number of bytes to load from a request
that's a possibility, but seeding the algorithm with random data slows down the performance of all hashmaps - e.g. SipHash is a lot slower than Wyhash pretty much across the board
a lot of languages pay that cost, but I'd rather not make all hashmaps slow just for the sake of this one use case (HTTP headers in web servers) if we can avoid it, especially when it seems that hash flooding can be mitigated in server-specific code without affecting all other users of the stdlib!
You can still use wyhash
It would add the cost of a few extra bitwise operations
The cost is storing an extra integer
Since hashing tends to be the fast part and the lookup is slow, this should have minimal costs
interesting!
I'm still overall skeptical that it's even necessary, but good to know it's an option!
interesting writeup of hash flooding attack defense in the most popular Haskell JSON parsing library: https://frasertweedale.github.io/blog-fp/posts/2021-10-12-aeson-hash-flooding-protection.html
the linked writeup (which is super detailed and includes examples of how to test a server for hash flooding susceptibility!) includes one of the mitigations we talked about early on here: don't alter the hash function itself, but rather have the collision resolution use something with a nonlinear traversal time, e.g. another hash map with a different hashing algorithm
As party of our investigation, we have come up with a new-ish approach to dealing with collisions. Instead of using linear chaining on collisions, we could instead recursively have the hashmap contain another hashmap that uses a different salt. This approach has been fully implemented and was ready to merge. The solution would strictly improve performance, at the cost of a minimal amount of extra memory in the case of a lot of collisions, without breaking backwards compatibility. However, it was rejected (privately) by two maintainers at the time because it does not come with a proof that it does not have any other vulnerabilities.
that reason for rejection sounds disappointing to me; the original talk introducing hash flooding explicitly cites this category of mitigation (improving the asymptotics of collision resolution) as an effective defense, and its seems like their testing harness is a pretty effective way of testing out whether the attack is still reproducible!
JSON dictionaries are perhaps a stronger case than HTTP headers for having the stdlib Dict be resistant to hash flooding
because certainly decoding JSON into a Dict will be something people will want to do, and often that JSON will be coming from an untrusted input (e.g. HTTP request)
and in that scenario, it seems like the only way someone could mitigate this (assuming Dict itself didn't) would be to cap the input size to a small enough number of bytes that executing the attack would be infeasible
but of course that limitation might not be compatible with their use case; it might be important that they allow untrusted inputs large enough to perform a successful hash flood attack
so putting all that together, maybe the best design is:
IndexMap with a fast hashwith a careful enough design of the fallback mechanism, it seems plausible that it would never affect performance in any noticeable way if an attack is not happening, while still being able to effectively mitigate an attack
relatedly, I'm not sure that the fallback map would need to use randomness; if it used a different hashing algorithm from the main map, and especially if it used one that was good at minimizing collisions (at the expense of hashing speed, say), it seems like it could potentially be infeasible for an attacker to find a set of keys that would actually collide on both hashing algorithms in practice
but if it needed to, as Brendan noted earlier, it would always be possible to do that; after the 4 "normal" collisions (or whatever number), it could generate a random seed on the fly and store that on the heap for future collision resolution. I'd like to avoid that if possible though, because we'd need to introduce another primitive for all hosts to implement (e.g. roc_random or something, similar to roc_alloc) unless we maybe used something like heap pointers as seeds - which of course would vary by Dict, but wouldn't be terribly random
I am very confused, how does the second hash map "strictly improve performance". Haven't read the write up yet, but that just does not sound possible.
EDIT: after reading the article, i think they mean strictly improve performance in the case of a hash collision attack, I do no think they mean improve performance in the general case.
Also, if we are already using indexmap because of ordering, salting would not affect the roc app at all. It would not be visible in roc user space at all. This means that the more complex solution of multiple hash maps shouldn't be needed in our case.
Also, depending on the goal, it could be a per compile random value instead of a new random value (that would avoid host additions at the cost of some security).
Long term, I think that host functions of this sort are of little cost because they should be possible to make into a library in the host language and shared. Even without a library, it is likely about 5 lines of code.
if it used a different hashing algorithm from the main map, and especially if it used one that was good at minimizing collisions (at the expense of hashing speed, say)
I don't think you can have a slow hashing algorithm for the second hashing algorithm. I am pretty sure every miss in the first map needs to check the second map. As such, if the second hashing function is slow, lookup time for items not in the map would be very slow.
added some more thoughts to the issue
For the fun of it, I just did a hash flooding attack on the Roc Set:
hash_flooding_str.png
hash_flooding_u128.png
hmm...my graphs uploaded weird...one sec...
Ok. fixed.
U128 has the expected exponential time cost when has flooding. Every time we insert double the number of elements, it takes 4 times longer to perform the operation.
For Str, since we have to allocate and copying into a List in order to get at the bytes, hashing takes a ton of time. As such, the cost of hashing overshadows the cost of the probing for inserts. So until we have seemless slices, I guess hash flooding doesn't make Str hashing much worse in Roc. Of course with enough elements, eventually hash flooding would effect strings as well.
Last updated: Jun 16 2026 at 16:19 UTC