Stream: ideas

Topic: hash flooding attacks


view this post on Zulip Richard Feldman (Sep 15 2022 at 05:21):

I just had an idea about hash flooding attacks - what if we encouraged decoding HTTP headers in servers by using the Decode ability to directly parse only the headers you care about?

view this post on Zulip Richard Feldman (Sep 15 2022 at 05:22):

in other words, rather than doing what most languages do (parsing the header bytes into a dictionary, and then letting you access that dictionary and do further decoding kinda like you do with environment variables, where you parse individual header strings into numbers/booleans/etc as needed), the server provides some function along the lines of:

headers : Request -> val | val has Decode

view this post on Zulip Richard Feldman (Sep 15 2022 at 05:24):

and it parses the raw header bytes directly into whatever data structure you want - e.g. let's say I have a record of type { contentLength : U64, contentType : Str }, the decoder could translate camelCase record field names into Capitalized-Kebab-Case, and just parse out exactly the headers I'm looking for in a single pass through the bytes

view this post on Zulip Richard Feldman (Sep 15 2022 at 05:25):

this would mean:

view this post on Zulip Richard Feldman (Sep 15 2022 at 05:27):

alternatively, if you need to access all the headers (because for whatever reason you want to iterate over them) you can always ask for them as a plain List, which is both the most efficient way to do that and also not vulnerable to hash flooding either

view this post on Zulip Brendan Hansknecht (Sep 15 2022 at 06:15):

How does the decode ability work with ignoring data? Also, how does decode work with scanning once while ignoring field ordering? Just haven't thought about those depths of it

view this post on Zulip Brian Hicks (Sep 15 2022 at 10:35):

I really like the idea extemporalgenome shared on GH: parse the common ones into a struct (constant time lookups, yay!) and hold the rest in some other data structure.

That said, I don't see how the "other data structure" with this signature can be something other than either an array of bytes or a linked list, both of which are gonna exhibit the perf characteristics that we wanted to avoid. (Unless the initial scan pass just found byte offsets? How would this actually work?)

view this post on Zulip Richard Feldman (Sep 15 2022 at 11:01):

yeah and the hashing required just to do the insertions alone can be enough for the attack to succeed

view this post on Zulip Brendan Hansknecht (Sep 15 2022 at 14:36):

Due to deciding on index map as the base of our dict, the hashing alg doesn't matter, correct? As such, to stop has flooding, we just need to seed the alg with random data when creating a dict and then it becomes super hard to flood due to being unique. Flooding requires knowing the target. Also, setting practically limits on where to stop loading headers should help.

view this post on Zulip Brendan Hansknecht (Sep 15 2022 at 14:36):

Just like it is normal to set limits to number of bytes to load from a request

view this post on Zulip Richard Feldman (Sep 15 2022 at 15:43):

that's a possibility, but seeding the algorithm with random data slows down the performance of all hashmaps - e.g. SipHash is a lot slower than Wyhash pretty much across the board

view this post on Zulip Richard Feldman (Sep 15 2022 at 15:44):

a lot of languages pay that cost, but I'd rather not make all hashmaps slow just for the sake of this one use case (HTTP headers in web servers) if we can avoid it, especially when it seems that hash flooding can be mitigated in server-specific code without affecting all other users of the stdlib!

view this post on Zulip Brendan Hansknecht (Sep 15 2022 at 17:32):

You can still use wyhash

view this post on Zulip Brendan Hansknecht (Sep 15 2022 at 17:33):

It would add the cost of a few extra bitwise operations

view this post on Zulip Brendan Hansknecht (Sep 15 2022 at 17:33):

The cost is storing an extra integer

view this post on Zulip Brendan Hansknecht (Sep 15 2022 at 17:34):

Since hashing tends to be the fast part and the lookup is slow, this should have minimal costs

view this post on Zulip Richard Feldman (Sep 15 2022 at 18:00):

interesting!

view this post on Zulip Richard Feldman (Sep 15 2022 at 18:01):

I'm still overall skeptical that it's even necessary, but good to know it's an option!

view this post on Zulip Richard Feldman (Sep 17 2022 at 04:00):

interesting writeup of hash flooding attack defense in the most popular Haskell JSON parsing library: https://frasertweedale.github.io/blog-fp/posts/2021-10-12-aeson-hash-flooding-protection.html

view this post on Zulip Richard Feldman (Sep 17 2022 at 04:02):

the linked writeup (which is super detailed and includes examples of how to test a server for hash flooding susceptibility!) includes one of the mitigations we talked about early on here: don't alter the hash function itself, but rather have the collision resolution use something with a nonlinear traversal time, e.g. another hash map with a different hashing algorithm

As party of our investigation, we have come up with a new-ish approach to dealing with collisions. Instead of using linear chaining on collisions, we could instead recursively have the hashmap contain another hashmap that uses a different salt. This approach has been fully implemented and was ready to merge. The solution would strictly improve performance, at the cost of a minimal amount of extra memory in the case of a lot of collisions, without breaking backwards compatibility. However, it was rejected (privately) by two maintainers at the time because it does not come with a proof that it does not have any other vulnerabilities.

view this post on Zulip Richard Feldman (Sep 17 2022 at 04:04):

that reason for rejection sounds disappointing to me; the original talk introducing hash flooding explicitly cites this category of mitigation (improving the asymptotics of collision resolution) as an effective defense, and its seems like their testing harness is a pretty effective way of testing out whether the attack is still reproducible!

view this post on Zulip Richard Feldman (Sep 17 2022 at 04:05):

JSON dictionaries are perhaps a stronger case than HTTP headers for having the stdlib Dict be resistant to hash flooding

view this post on Zulip Richard Feldman (Sep 17 2022 at 04:05):

because certainly decoding JSON into a Dict will be something people will want to do, and often that JSON will be coming from an untrusted input (e.g. HTTP request)

view this post on Zulip Richard Feldman (Sep 17 2022 at 04:07):

and in that scenario, it seems like the only way someone could mitigate this (assuming Dict itself didn't) would be to cap the input size to a small enough number of bytes that executing the attack would be infeasible

view this post on Zulip Richard Feldman (Sep 17 2022 at 04:07):

but of course that limitation might not be compatible with their use case; it might be important that they allow untrusted inputs large enough to perform a successful hash flood attack

view this post on Zulip Richard Feldman (Sep 17 2022 at 04:11):

so putting all that together, maybe the best design is:

view this post on Zulip Richard Feldman (Sep 17 2022 at 04:13):

with a careful enough design of the fallback mechanism, it seems plausible that it would never affect performance in any noticeable way if an attack is not happening, while still being able to effectively mitigate an attack

view this post on Zulip Richard Feldman (Sep 17 2022 at 04:15):

relatedly, I'm not sure that the fallback map would need to use randomness; if it used a different hashing algorithm from the main map, and especially if it used one that was good at minimizing collisions (at the expense of hashing speed, say), it seems like it could potentially be infeasible for an attacker to find a set of keys that would actually collide on both hashing algorithms in practice

view this post on Zulip Richard Feldman (Sep 17 2022 at 04:18):

but if it needed to, as Brendan noted earlier, it would always be possible to do that; after the 4 "normal" collisions (or whatever number), it could generate a random seed on the fly and store that on the heap for future collision resolution. I'd like to avoid that if possible though, because we'd need to introduce another primitive for all hosts to implement (e.g. roc_random or something, similar to roc_alloc) unless we maybe used something like heap pointers as seeds - which of course would vary by Dict, but wouldn't be terribly random

view this post on Zulip Brendan Hansknecht (Sep 17 2022 at 05:21):

I am very confused, how does the second hash map "strictly improve performance". Haven't read the write up yet, but that just does not sound possible.

EDIT: after reading the article, i think they mean strictly improve performance in the case of a hash collision attack, I do no think they mean improve performance in the general case.

view this post on Zulip Brendan Hansknecht (Sep 17 2022 at 05:52):

Also, if we are already using indexmap because of ordering, salting would not affect the roc app at all. It would not be visible in roc user space at all. This means that the more complex solution of multiple hash maps shouldn't be needed in our case.

Also, depending on the goal, it could be a per compile random value instead of a new random value (that would avoid host additions at the cost of some security).

Long term, I think that host functions of this sort are of little cost because they should be possible to make into a library in the host language and shared. Even without a library, it is likely about 5 lines of code.

view this post on Zulip Brendan Hansknecht (Sep 17 2022 at 05:58):

if it used a different hashing algorithm from the main map, and especially if it used one that was good at minimizing collisions (at the expense of hashing speed, say)

I don't think you can have a slow hashing algorithm for the second hashing algorithm. I am pretty sure every miss in the first map needs to check the second map. As such, if the second hashing function is slow, lookup time for items not in the map would be very slow.

view this post on Zulip Richard Feldman (Sep 17 2022 at 20:34):

added some more thoughts to the issue

view this post on Zulip Brendan Hansknecht (Mar 04 2023 at 05:22):

For the fun of it, I just did a hash flooding attack on the Roc Set:

hash_flooding_str.png
hash_flooding_u128.png

view this post on Zulip Brendan Hansknecht (Mar 04 2023 at 05:22):

hmm...my graphs uploaded weird...one sec...

view this post on Zulip Brendan Hansknecht (Mar 04 2023 at 05:24):

Ok. fixed.

view this post on Zulip Brendan Hansknecht (Mar 04 2023 at 05:26):

U128 has the expected exponential time cost when has flooding. Every time we insert double the number of elements, it takes 4 times longer to perform the operation.

view this post on Zulip Brendan Hansknecht (Mar 04 2023 at 05:28):

For Str, since we have to allocate and copying into a List in order to get at the bytes, hashing takes a ton of time. As such, the cost of hashing overshadows the cost of the probing for inserts. So until we have seemless slices, I guess hash flooding doesn't make Str hashing much worse in Roc. Of course with enough elements, eventually hash flooding would effect strings as well.


Last updated: Jun 16 2026 at 16:19 UTC