hashmap alternatives · ideas · Zulip Chat Archive

so I've been surprised to learn over the past year that hashmaps are often slower in practice than flat arrays of key/value tuples where you do an O(n) linear scan for most operations instead of the hashmap's O(1).

Richard Feldman (Nov 15 2021 at 05:02):

Richard Feldman (Nov 15 2021 at 05:03):

but of course eventually the O(n) will catch up with you; if you have a billion elements in the collection, for example, a hash map will be way faster than a linear scan of an array

Richard Feldman (Nov 15 2021 at 05:03):

but this makes me wonder about a sorted version of a dictionary backed by a flat array

Richard Feldman (Nov 15 2021 at 05:04):

Richard Feldman (Nov 15 2021 at 05:05):

one downside I can see is that you have a lot more branch mispredictions when it's searching the array to find the right spot

Richard Feldman (Nov 15 2021 at 05:06):

which wouldn't happen in the case of the linear array scan; instead, the CPU would always be able to prefetch the next cache line because it would know you're doing a linear scan

Richard Feldman (Nov 15 2021 at 05:06):

so maybe the O(log n) sorted version might also run slower than the O(n) linear scan! :sweat_smile:

Richard Feldman (Nov 15 2021 at 05:11):

it makes me wonder if there's a "small string optimization" version of the data structure which could be helpful - that is, at a small enough capacity (say, under 1024 elements) the backing data structure is a flat array, and above that it's a hash map

Richard Feldman (Nov 15 2021 at 05:11):

so you get linear scan performance for smaller collections, and then for larger collections you get a branch misprediction followed by O(1) operations

Richard Feldman (Nov 15 2021 at 05:12):

I don't know if there's any data structure out there like that, or any research into its performance!

Richard Feldman (Nov 15 2021 at 05:34):

I also be wonder about SIMD and data structures like that - for example, if you had small enough values, you could potentially do a linear scan faster using SIMD

Richard Feldman (Nov 15 2021 at 05:35):

but you could take that idea a step further and, instead of using tuples, organize the backing array into 64B chunks with the keys at the beginning and then as many values as will fit in the remainder of the 64B

Richard Feldman (Nov 15 2021 at 05:36):

Richard Feldman (Nov 15 2021 at 05:37):

so for small keys (e.g. numbers) you could probably often scan an entire 64B chunk in 1-2 SIMD instructions to find out if the key is there, and if so, the corresponding value is already in cache

Richard Feldman (Nov 15 2021 at 05:39):

alternatively, you could store all the keys in one array and all the values in another; that would mean you'd be able to find the key much faster, but when you did, the value wouldn't be prefetched, so you might have a cache miss

Richard Feldman (Nov 15 2021 at 05:39):

Brendan Hansknecht (Nov 15 2021 at 06:09):

I think the very interesting question and likely hard part of a data structure for something like this is the graceful transition from one form to another. If you use a flat map based dictionary and it grows large enough that you want to swap it to a hashmap based dictionary, there will likely be a huge cost that will make performance analysis quite brittle.

E.g. yesterday it ran fast but today it is terribly slow. Oh, today we have 1025 items and the cost to convert to a hash map broke our performance.

Brendan Hansknecht (Nov 15 2021 at 06:12):

I think being able to pick an implementation based on the use case is very important, but without a ton of research and testing, I would be extremely skeptical of defaulting to something other than a hash map.

Brendan Hansknecht (Nov 15 2021 at 06:16):

Would mostly be worried about blowing up the worst case performance and users believing roc is a terrible language because of it.

So i guess hashmap is defense against the worst case as opposed to picking for average or best case.

Folkert de Vries (Nov 15 2021 at 08:16):

just a general question (also relevant to rust): at what point does storing the keys and values in separate arrays make sense?

Folkert de Vries (Nov 15 2021 at 08:17):

it removes padding potentially, but if the key is small, then simd will just tear through that array of keys

Folkert de Vries (Nov 15 2021 at 08:17):

David K (Nov 15 2021 at 09:40):

I wonder if the best approach would be to have 2 different structures with an identical API, so that one is a drop in for another. Users can choose which ever one they need (and test the performance on a case by case basis).

David K (Nov 15 2021 at 09:40):

So you could have a Map which is the default "probably good enough" version backed by a hash map and a SmallMap for increased perf on small maps.

David K (Nov 15 2021 at 09:40):

Folkert de Vries (Nov 15 2021 at 09:41):

Yes, I prefer rust's model here where there is one solid choice in the stdlib, and if you want something special, there are packages for that

Folkert de Vries (Nov 15 2021 at 09:41):

I prefer that to c++'s stdlib having many competing subtly different data structures that seem to do the same thing

David K (Nov 15 2021 at 09:45):

Ye, in c++ the annoying thing is that the one that looks like the "default" one, std::map, is the one you should probably never use.

Folkert de Vries (Nov 15 2021 at 09:48):

David K (Nov 15 2021 at 10:56):

I am slightly confused by the if example in that article. The ternary operator chain they use skips the step of loading the string from the array for the comparison.

David K (Nov 15 2021 at 10:56):

I tried running it with a function that does have to load the string and got a result that's about 2 times worse (but still faster then std::map and std::unordered_map.

David K (Nov 15 2021 at 11:04):

(turns out I messed up the function slightly and it's actually closer to 1/3 slower)

David K (Nov 15 2021 at 11:05):

if2_t2e 1.8518 # having to look up values in the array
if_t2e 1.12448
unordered_map_t2e 2.30018
map_t2e 2.99629

Folkert de Vries (Nov 15 2021 at 11:10):

David K (Nov 15 2021 at 11:17):

C++20 is a pretty nice modern-ish language. Unfortunately this article uses C++11...

Zeljko Nesic (Nov 15 2021 at 13:31):

What might make sense is calling one thing:FastDict and the other one BigDict.

I absolutely agree with the idea to give optimal data structures, but I think we should be transparent about it. Maybe you'd be able to query it at runtime eg:

implementation : Dict k a -> [ FastVariant, BigVariant ]

Joshua Warner (Nov 15 2021 at 14:06):

I like the idea of the implementation being easily swappable - and furthermore that the roc runtime/library/etc can just pick the "right" one most of the time. This could be driven by:

This could be either staticly or dynamically dispatched to the correct impl - but if it's static, I'd say that mostly rules out the automated/PGO option since we'd definitely want to avoid cases where PGO could accidentally make the wrong decision and lead to pessimization if you happen to hit some extreme case.

Also note, this is potentially a security problem if you pick the wrong one, since it can lead to much easier DOS attacks. The hashmap impl will be the safe-but-slow one.

Jim Duey (Nov 15 2021 at 14:36):

Very interesting area to put some thought into. Pardon me for asking a basic question, but does Roc use anything like a Hash Array Mapped Trie? https://en.wikipedia.org/wiki/Hash_array_mapped_trie (Clojure programmer here. :grinning: ) There's been some work in the Clojure community to make persistent hash maps fast. And I did a version in C that automatically detects when you can mutate a map rather than duplicate structure that relies on the ref count of the data structs.

kprotty (Nov 15 2021 at 14:38):

This seems to be what Erlang does so maybe it has some merit in practice: https://www.erlang.org/doc/efficiency_guide/maps.html
TLDR: they use a flat array for maps/dicts <= 32 items and switch to a HAMT which @Jim Duey mentions above

kprotty (Nov 15 2021 at 14:47):

I'm just here to give out references :), but it seems like you're close to describing the design of abseil/swiss hash tables (the default implementation of HashMap in Rust). Some references down below, but the idea is to store 7 or 8 bits of the hash in a separate 16-byte chunk and use SIMD to scan 16 slots at a time (by comparing their hashes: cmpeq_epi8). There would be false positives / collisions but the hope is that the linear SIMD scan instead of jumping around like in quadratic probing + quickly skipping over cells that definitely don't match = speed increase. Also just something to look into

Richard Feldman (Nov 15 2021 at 16:45):

Brendan Hansknecht (Nov 15 2021 at 17:32):

Especially if abilities pan out so that Dict doesn't need to be a builtin, I would be 100% for offering any of the super efficient hash maps as our base. Then we can document the performance characteristics well and add comments about other map alternatives that either we also provide implementations of, or we leave to the community to make packages for.
Of course, some of these algorithms might be a tad complex to get the needed performance in only Roc, so not sure how to deal with that.

Richard Feldman (Nov 15 2021 at 17:45):

yeah one idea to build on that is to aim for a builtin dictionary that has good asymptotics (so your program doesn't explode if you get an outlier amount of entries) but which overall prioritizes speed for a low number of entries (at the expense of performance at a high number of entries), since the more entries you have, the more likely you are to want to specialize away from a one-size-fits-all default and towards a collection more tailored to your particular use case

Richard Feldman (Nov 15 2021 at 18:38):

Richard Feldman (Nov 15 2021 at 18:43):

Brendan Hansknecht (Nov 15 2021 at 19:32):

Also, another random note, it is interesting to think what different groups are optimizing for. For example, Google variants of things tend to be optimized for the plethora of internal use cases and tend to focus more on memory usage than speed (at least if they have been optimized in the last 5ish years). The reason for this is that memory usage costs more than compute. Of course there are certain code paths that will directly affect latency of a web service and compute performance matters, but generally speaking adding more cores is cheap and web services are embarrassingly parallel.

Joseph Anthony Zullo (Nov 17 2021 at 13:16):

Well, since our lists are really arrays, I don't think tree maps would make sense here? To be consistent we should probably be using array like data structures?

Joseph Anthony Zullo (Nov 17 2021 at 13:17):

It's also worth pointing out that there are differences depending on your type of input, I remember reading an article somewhere that showed that Java bitsets are more efficient than Java hashmaps well into bitset sizes into the 1000s, but this only works for integer sets.

Joseph Anthony Zullo (Nov 17 2021 at 13:20):

I know elixir/erlang does just association arrays into hash trees, to be consistent with our list implementation we should probably do association arrays to hash tables, unless I'm misunderstanding how lists work here.

Brendan Hansknecht (Nov 17 2021 at 16:57):

Can you elaborate more on this? Why should the map data structure be consistent with the array data structure?

Joseph Anthony Zullo (Nov 18 2021 at 15:03):

On second thought I don't think I'm right, it does clearly say that Dict is like that in Elm in the roc-for-elm-programmers. I just thought it was weird to have lists be a mutable vector under the hood and then to use hash trees (someone mentions HAMTs) for Dicts which are a purely functional data structure. We have Python's list and Elm's Dict. There isn't anything necessarily wrong with this approach, and it could be potentially useful. But it can also be confusing why lists can't be used nonlinearly without cloning, but dicts are cloned for free automatically.

Stream: ideas

Topic: hashmap alternatives

Richard Feldman (Nov 15 2021 at 05:01):

Richard Feldman (Nov 15 2021 at 05:02):

Richard Feldman (Nov 15 2021 at 05:03):

Richard Feldman (Nov 15 2021 at 05:03):

Richard Feldman (Nov 15 2021 at 05:04):

Richard Feldman (Nov 15 2021 at 05:05):

Richard Feldman (Nov 15 2021 at 05:06):

Richard Feldman (Nov 15 2021 at 05:06):

Richard Feldman (Nov 15 2021 at 05:11):

Richard Feldman (Nov 15 2021 at 05:11):

Richard Feldman (Nov 15 2021 at 05:12):

Richard Feldman (Nov 15 2021 at 05:34):

Richard Feldman (Nov 15 2021 at 05:35):

Richard Feldman (Nov 15 2021 at 05:36):

Richard Feldman (Nov 15 2021 at 05:37):

Richard Feldman (Nov 15 2021 at 05:39):

Richard Feldman (Nov 15 2021 at 05:39):

Brendan Hansknecht (Nov 15 2021 at 06:09):

Brendan Hansknecht (Nov 15 2021 at 06:12):

Brendan Hansknecht (Nov 15 2021 at 06:16):

Folkert de Vries (Nov 15 2021 at 08:16):

Folkert de Vries (Nov 15 2021 at 08:17):

Folkert de Vries (Nov 15 2021 at 08:17):

David K (Nov 15 2021 at 09:40):

David K (Nov 15 2021 at 09:40):

David K (Nov 15 2021 at 09:40):

Folkert de Vries (Nov 15 2021 at 09:41):

Folkert de Vries (Nov 15 2021 at 09:41):

David K (Nov 15 2021 at 09:45):

Folkert de Vries (Nov 15 2021 at 09:48):

David K (Nov 15 2021 at 10:56):

David K (Nov 15 2021 at 10:56):

David K (Nov 15 2021 at 11:04):

David K (Nov 15 2021 at 11:05):

Folkert de Vries (Nov 15 2021 at 11:10):

Folkert de Vries (Nov 15 2021 at 11:10):

David K (Nov 15 2021 at 11:17):

Zeljko Nesic (Nov 15 2021 at 13:31):

Joshua Warner (Nov 15 2021 at 14:06):

Jim Duey (Nov 15 2021 at 14:36):

kprotty (Nov 15 2021 at 14:38):

kprotty (Nov 15 2021 at 14:47):

Richard Feldman (Nov 15 2021 at 16:45):

Brendan Hansknecht (Nov 15 2021 at 17:32):

Richard Feldman (Nov 15 2021 at 17:45):

Richard Feldman (Nov 15 2021 at 18:38):

Richard Feldman (Nov 15 2021 at 18:43):

Richard Feldman (Nov 15 2021 at 18:43):

Brendan Hansknecht (Nov 15 2021 at 19:32):

Joseph Anthony Zullo (Nov 17 2021 at 13:16):

Joseph Anthony Zullo (Nov 17 2021 at 13:17):

Joseph Anthony Zullo (Nov 17 2021 at 13:20):

Brendan Hansknecht (Nov 17 2021 at 16:57):

Joseph Anthony Zullo (Nov 18 2021 at 15:03):