Stream: beginners

Topic: Dict insertion order


view this post on Zulip Kiryl Dziamura (May 24 2023 at 12:32):

What is the motivation behind the insertion order preservation in dicts?

x = Dict.empty {}
    |> Dict.insert "a" 0
    |> Dict.insert "b" 1

y = Dict.empty {}
    |> Dict.insert "b" 1
    |> Dict.insert "a" 0

My gut feeling says the resulting dicts are structurally the same, but the insertion order says the opposite. The Hash implementation should always respect the insertion order otherwise the order is pure overhead and a potential source of ambiguity. So the eq check should always give false here.

Maybe my confusion comes from the name Dict. OrderedDict would emphasize the presence of the order which is important for the structure. On the other hand, if there is an Ordered collection - where is the Unordered one?

Maybe my intuition is wrong here. Anyway, the questions:

Upd. there's also a Dict.remove that performs a swap remove, which breaks the idea of insertion order, no?

view this post on Zulip Richard Feldman (May 24 2023 at 12:53):

good question!

view this post on Zulip Richard Feldman (May 24 2023 at 12:56):

the main motivation is:

view this post on Zulip Richard Feldman (May 24 2023 at 12:57):

we based the design on IndexMap from Rust, which has these properties. remove changes the order for performance reasons; we plan to add a removeShift which is (a lot) slower, but preserves insertion order.

view this post on Zulip Richard Feldman (May 24 2023 at 12:59):

when they tried replacing the standard (unordered) hashmap in the Rust compiler with IndexMap, they found the performance was basically the same (slightly better in some places, slightly worse in others, overall about a wash) - generally, compared to a normal hash map, IndexMap takes up more memory and takes a little bit longer to insert and to remove, but is faster at traversals. So it depends on how you're using it, but overall the amount of extra time seems to be small enough to be unconcerning.

view this post on Zulip Kiryl Dziamura (May 24 2023 at 13:25):

That makes sense, I also saw the link to the IndexMap in the std docs and looked through the issue.

But from the equality perspective - is my understanding correct that the order matters?

So the rule of thumb can be "if you need to walk your collection - use Dict, if you don't plan to walk much and need equality check - use something else"?

Also, if my understanding is correct, then there's an inconsistency in Set:

x = Set.empty {}
    |> Set.insert 0
    |> Set.insert 1

y = Set.empty {}
    |> Set.insert 1
    |> Set.insert 0

x == y

but running Set.toList on them produces order preserved results:

[0, 1] : List (Num *)
[1, 0] : List (Num *)

view this post on Zulip Richard Feldman (May 24 2023 at 13:34):

== should not care about insertion order for either Set or Dict

view this post on Zulip Richard Feldman (May 24 2023 at 13:35):

(if it does, that's a bug!)

view this post on Zulip Kiryl Dziamura (May 24 2023 at 14:00):

Then Set.toList is not correct because if both sets are equal, any operation on them should produce the same result. But Set.toList outputs lists ordered differently. The same intuition for the Dict.
Otherwise, I probably don't get the equality concept in roc

view this post on Zulip Richard Feldman (May 24 2023 at 14:25):

yeah we discussed this scenario - since Roc supports custom equality (e.g. anyone can write a custom Eq implementation for their opaque type which does whatever they like), it's not possible to guarantee in the general case that x == y implies f x == f y (unless we get rid of custom equality, but it seems more valuable to have custom equality in the language than this guarantee)

so since we already can't have that guarantee, the design questions for Eq and toList on dictionaries and sets become things like:

view this post on Zulip Richard Feldman (May 24 2023 at 14:25):

performance aside, unfortunately this is a case where making it less surprising makes it more error-prone

view this post on Zulip Richard Feldman (May 24 2023 at 14:26):

specifically, it seems common to use == on dictionaries and sets assuming that they'll be equal if they have all the same keys and values

view this post on Zulip Richard Feldman (May 24 2023 at 14:27):

especially in the case of sets - that's basically the definition of "set equality" outside of programming

view this post on Zulip Richard Feldman (May 24 2023 at 14:28):

so although it's surprising that toList a == toList b might be false when a == b it doesn't seem as likely to cause bugs as having == on the collections themselves sometimes return false even when the collections have the same contents (but they were inserted in a different order)

view this post on Zulip Richard Feldman (May 24 2023 at 14:29):

so unfortunately this is a situation where there does not exist a design that checks every box, and this design is the one we ended up with because it seems to be less error-prone than the less-surprising alternative

view this post on Zulip Kiryl Dziamura (May 24 2023 at 15:33):

Outside the programming, sets have no order. Currently, the Set looks like a structure

{
  sortedUniqueItems: [a, b, c],
  orderIndicies: [0, 2, 1]
}

When we iterate over it, we get the order a, c, b combining both of the "fields" and when we hash it - we use only the sortedUniqueItems part of the structure. So hashing algorithm here is an intentional collision. To avoid that, it should either use the whole structure or the structure shouldn't have the orderIndicies

But here is where the performance point bites. insert or toList can potentially become more complex (to keep the sorted list)


I would suggest to drop the iteration out from both Dict and Set or provide an arbitrary order because order has no sense there conceptually but it would be too bold to suggest such change :D
My main concern is that it's not possible to have both insertion order and equality by hash for these structures (whatever is possible of corse, it's only a matter of convention. More on this below)


it's not possible to guarantee in the general case that x == y implies f x == f y

The most interesting part in this snippet is the ==. What I'm talking about is about a contract which the Eq ability is. The ability should provide a clear meaning of itself - with this, we can say if the implementation is correct for the structure. The snippet can be rewritten this way: "A.Eq x y doesn't imply the B.Eq (f x) (f y)" and it will be correct because now we see that we have different equalities. But since Eq are not functions of the A and B types but are implementation of the ability Eq - the general case for the snippet should be guaranteed otherwise it's a violation of the contract and can be considered as a bug.

Shortly, yes, it's not possible to guarantee the general case, but it's possible for the standard library.

view this post on Zulip Brendan Hansknecht (May 24 2023 at 15:41):

I am confused by your first example, what is {sortedUniqueItems, orderIndices} supposed to be?

view this post on Zulip Kiryl Dziamura (May 24 2023 at 15:44):

An abstract underlying structure, minimal implementation that corresponds the current behavior

view this post on Zulip Brendan Hansknecht (May 24 2023 at 15:48):

Sure, but it is not representative of the perf characteristics. The current behavior is faster for some uses and slower for other uses. It is based on indexmap. They tested replacing every hashmap in rustc with an indexmap and the perf was approximately the same. rustc has a lot of perf benchmarking, so I think this is an exceptionally minimal tradeoff.

view this post on Zulip Brendan Hansknecht (May 24 2023 at 15:49):

For example, insert is basically the same cost and toList is actually free.

view this post on Zulip Kiryl Dziamura (May 24 2023 at 15:50):

Yeah, Richard told the same. I’m not talking about the performance already

view this post on Zulip Brendan Hansknecht (May 24 2023 at 15:57):

As for arbitrary order there are some issues. First, you have to decide if the dict is randomly or statically seeded. Then based on that you hit different issues.

Statically seeded:

Randomly seeded:

view this post on Zulip Brendan Hansknecht (May 24 2023 at 15:58):

With insertion order, we can be randomly seeded without issues since the hashed order is never exposed to userland.

view this post on Zulip Brendan Hansknecht (May 24 2023 at 15:59):

both insertion order and equality by hash

This is possible. We do it for Set.

view this post on Zulip Brendan Hansknecht (May 24 2023 at 16:02):

hmm, actually we don't currently do it for Set, but we have hashUnordered which means we could do it.

view this post on Zulip Ben Cohen (May 24 2023 at 16:06):

more valuable to have custom equality in the language than this guarantee

I don't really get this one ... is there some way to define == such that the user never 'feels a missing need' for a custom == implementation _and_ they get a wider guarantee for more values x,y that x == y => f x == f y ?

Type-specific definitions for a notion of 'sameness comparison' that lose x eq y => f x eq f y trends more toward 'type specific' operations to me -- is there really a reason to force the variety of kinds of 'application specific sameness operations for a type' to share the == operator name ...?

What is the unifying design principle for == in roc such that folks implementing a custom == would be likely to minimize surprise when implementing a custom == for their types?

view this post on Zulip Brendan Hansknecht (May 24 2023 at 16:11):

A simple example here that can break x == y => f x == f y is any structure with a capacity or group of unused elements. If any information is exposed about the capacity or unused elements in any way, you can depend on that in f to make the comparison no longer equal.

view this post on Zulip Richard Feldman (May 24 2023 at 16:14):

Kiryl Dziamura said:

it's not possible to guarantee the general case, but it's possible for the standard library.

sure, but as guarantees go, I don't think that's a particularly valuable one. :big_smile:

to me, the main value in guarantees is that they let you quickly rule things out and/or stop thinking about them altogether, without getting burned

view this post on Zulip Richard Feldman (May 24 2023 at 16:15):

"can I rely on a == b implying f a == f b? hang on, let me real quick just make sure that f, a, and b only use builtin functions and data structures..."

view this post on Zulip Richard Feldman (May 24 2023 at 16:16):

I think as a Roc programmer, it's more helpful to have a mental model of "I can never rely on a == b implying f a == f b" so I don't really see much value in having that be true for builtins (and one could even argue it's a footgun because you might get accustomed to that being true, and then get burned when you discover it's not true after using a userspace opaque type with custom equality)

view this post on Zulip Richard Feldman (May 24 2023 at 16:18):

as a quick note, which might be helpful: here are some of the things that are important to me in this design decision:

view this post on Zulip Richard Feldman (May 24 2023 at 16:19):

so for example, "let's change it so that you are no longer permitted to do custom equality in Roc" is a big downside, and so I think there would need to be a correspondingly large upside on one of these dimensions to justify it

view this post on Zulip Richard Feldman (May 24 2023 at 16:20):

e.g. "people are seeing bugs all the time in Python because they also maintain dictionary insertion order but don't consider it during equality checks" would be a compelling point because it speaks to "how error-prone is it" (I haven't heard this from Python programmers though)

view this post on Zulip Richard Feldman (May 24 2023 at 16:21):

but like "this design just feels wrong to me" is not specific enough a reason to change it, given all the tradeoffs here...we've discussed a lot of alternative designs in the past, and they all had more serious downsides than "it feels wrong" (and often included that one too, in one way or another!)

view this post on Zulip Richard Feldman (May 24 2023 at 16:23):

also worth noting: "how surprising is the behavior" has an unfortunately high minimum amount of surprise here because...

so either design would be surprising to one substantial group or another. Also, if we were to get rid of traversals altogether, that would be an extreme loss of functionality.

view this post on Zulip Richard Feldman (May 24 2023 at 16:33):

we could require toList (and walk, and any other traversals where ordering becomes observable) to take a sorting function or something like that, but that would have an extremely high performance cost for the most common case where all you want to do is to traverse the collection and don't care about order

view this post on Zulip Kiryl Dziamura (May 24 2023 at 16:34):

No offense from my side. Your reasoning means that Eq doesn't require the extensionality property. It’s strange to me but I can live with that 😊

view this post on Zulip Richard Feldman (May 24 2023 at 16:35):

fair enough! :smiley:

view this post on Zulip Brendan Hansknecht (May 24 2023 at 16:56):

General question, do any languages have extensional equality across all types? I would assume that would be pretty impossible with reasonable ergonomics. You wouldn't be able to have any types with custom equality (at least to my understanding).

view this post on Zulip Kiryl Dziamura (May 24 2023 at 17:10):

I doubt, will ask my friend Gepete :grinning_face_with_smiling_eyes:
I think it's in general an expected property, but can’t be mandatory. Btw, it would be nice to add the Set.toList as an example of such behavior to the Eq doc

view this post on Zulip Ben Cohen (May 24 2023 at 17:30):

Interesting! Thanks for the thoughts :)

Out of curiosity -- I wonder the degree to which the examples mentioned actually fully cover the design space for custom ==

eg

These seem like compelling examples ... are they actually the only ones that matter?

Sorry if Im asking folks to rediscuss something that has been debated sufficiently and well settled -- I just wandering into this discussion by wondering if there might be a design where less power could actually be "better" and more likely to maximize the benefit ... - like maybe custom == is not a general function users should be able to define for types -- but more like an optional description per type of some fields that are not observable for equality comparison and some fields that have a particular behavior with regard to sequences and ordering ...

view this post on Zulip Brendan Hansknecht (May 24 2023 at 17:43):

Another example, though in a similar line is having a file and collision resistant large hash stored together. Then only comparing the hash instead of the entire file.

view this post on Zulip Brendan Hansknecht (May 24 2023 at 17:43):

I have seen that in systems.

view this post on Zulip Brendan Hansknecht (May 24 2023 at 17:46):

Also, in Roc, opaque types do not have equality by default. So any struct with an opaque type in it would lose equality without having custom equality on the opaque type.

view this post on Zulip Brendan Hansknecht (May 24 2023 at 17:46):

But yeah, I think it mostly comes down to custom handling of fields (ignored, out of order, only going to length instead of capacity)

view this post on Zulip Kiryl Dziamura (May 24 2023 at 22:10):

Brendan Hansknecht said:

A simple example here that can break x == y => f x == f y is any structure with a capacity or group of unused elements. If any information is exposed about the capacity or unused elements in any way, you can depend on that in f to make the comparison no longer equal.

In the case of capacity the structures with the same payload but different capacities are equal as long as the capacity doesn't affect any properties of the structure. If capacity means allocation, which can be extended/panic when the capacity is exceeded, then capacity is a kind of metadata, not data itself, and shouldn't affect the eq check. On the other hand, if the capacity is a part of business logic and acts as a limiter that ignores new data when the cap is exceeded - it’s part of the payload.

I mean, capacity, length, empty items - they are parts of metadata if we are talking about a preallocated array. The same way the order is part of metadata for a set. Since the set has no order by its nature - it’s only a nuance of implementation.

The exposed capacity of an array or the exposed order of a set - they are exposure of implementation details. From that perspective, there are no problems with the insertion order.

I apologize for being picky and annoying. Just wanted to explain my intuition. I think we are on the same page. In the end of the day, what is considered as meta is the matter of the domain

view this post on Zulip Brendan Hansknecht (May 24 2023 at 22:38):

Not picky and annoying. A good clarification of how you view it.

view this post on Zulip Brendan Hansknecht (May 24 2023 at 22:39):

I do agree that the order of a set can just be an implementation detail (we have decided to expose it in a specific way).

view this post on Zulip Brendan Hansknecht (May 24 2023 at 22:39):

I think where this gets complex is that people want to convert sets to lists or walk the elements of a set. Those are common use cases.

view this post on Zulip Brendan Hansknecht (May 24 2023 at 22:40):

In order to avoid a large cost, exposing the set order explicitly is the cleanest option (at least from our view so far based on looking at tradeoffs. Totally open to consider other solutions)

view this post on Zulip Brendan Hansknecht (May 24 2023 at 22:49):

If Roc wasn't a pure functional language, I definitely would push for a random order. Sadly, that leaks randomness into Roc userland: Insert 0 to 10 into a set; Convert it to a list; Get the first element; you now have a random number.

view this post on Zulip Kiryl Dziamura (May 25 2023 at 07:13):

For security purposes, the problem with the weak hash system you outlined (predictable hashes) is a problem of the developer's expertise and choice of tool, I think. I believe there are crypto structures based on a stable but configurable entropy source. But it would be cumbersome to do in std, of course, thus it makes sense to have a predictable order in terms of reliability. I see only three ways to do so: insertion order, sorted order, and rnd based as a special case for the sorted. Since rnd variant requires an explicit input for the structure, and sorting requires the implementation of Ord for the elements, the only simple choice is the insertion order that anyway allows sorting afterward.
So all points in the thread make sense to me.

view this post on Zulip Brendan Hansknecht (May 25 2023 at 14:14):

Just to clarify, the problem isn't with a weak hash system. The problem is having a known hash system and seed (anyone can look at the roc source code and find out what it is)

Even if we were to switch to a cryptographically secure hash, this attack is totally doable due to the limited capacity of a dict/set. On top of that a cryptographically secure hash is so slow that it isn't reasonable to begin with.

We did chat a good bit about requiring a userland specified seed. Which could alleviate the larger problem. The issue is that not all platforms provide a way to get randomness and if a user hardcodes a seed cause it is convenient, they are back to the same issue.

view this post on Zulip Brendan Hansknecht (May 25 2023 at 14:18):

Aside: maybe not yet in the latest nightly, but Set hashing and Dict hashing + equality should now fully work.

So sets as dict keys and checking if dict1 == dict2 should now work. Though dict only has Eq if the value it holds has Eq.

view this post on Zulip Kiryl Dziamura (May 25 2023 at 14:27):

I understand the problem with public algo, just used too general wording.
A great point regarding the platform constraints on crypto security btw!


Last updated: Jul 06 2025 at 12:14 UTC