What is the motivation behind the insertion order preservation in dicts?
x = Dict.empty {}
|> Dict.insert "a" 0
|> Dict.insert "b" 1
y = Dict.empty {}
|> Dict.insert "b" 1
|> Dict.insert "a" 0
My gut feeling says the resulting dicts are structurally the same, but the insertion order says the opposite. The Hash
implementation should always respect the insertion order otherwise the order is pure overhead and a potential source of ambiguity. So the eq check should always give false
here.
Maybe my confusion comes from the name Dict
. OrderedDict
would emphasize the presence of the order which is important for the structure. On the other hand, if there is an Ordered
collection - where is the Unordered
one?
Maybe my intuition is wrong here. Anyway, the questions:
Upd. there's also a Dict.remove
that performs a swap remove, which breaks the idea of insertion order, no?
good question!
the main motivation is:
we based the design on IndexMap
from Rust, which has these properties. remove
changes the order for performance reasons; we plan to add a removeShift
which is (a lot) slower, but preserves insertion order.
when they tried replacing the standard (unordered) hashmap in the Rust compiler with IndexMap
, they found the performance was basically the same (slightly better in some places, slightly worse in others, overall about a wash) - generally, compared to a normal hash map, IndexMap
takes up more memory and takes a little bit longer to insert and to remove, but is faster at traversals. So it depends on how you're using it, but overall the amount of extra time seems to be small enough to be unconcerning.
That makes sense, I also saw the link to the IndexMap
in the std docs and looked through the issue.
But from the equality perspective - is my understanding correct that the order matters?
So the rule of thumb can be "if you need to walk your collection - use Dict
, if you don't plan to walk much and need equality check - use something else"?
Also, if my understanding is correct, then there's an inconsistency in Set
:
x = Set.empty {}
|> Set.insert 0
|> Set.insert 1
y = Set.empty {}
|> Set.insert 1
|> Set.insert 0
x == y
but running Set.toList
on them produces order preserved results:
[0, 1] : List (Num *)
[1, 0] : List (Num *)
==
should not care about insertion order for either Set
or Dict
(if it does, that's a bug!)
Then Set.toList
is not correct because if both sets are equal, any operation on them should produce the same result. But Set.toList
outputs lists ordered differently. The same intuition for the Dict
.
Otherwise, I probably don't get the equality concept in roc
yeah we discussed this scenario - since Roc supports custom equality (e.g. anyone can write a custom Eq
implementation for their opaque type which does whatever they like), it's not possible to guarantee in the general case that x == y
implies f x == f y
(unless we get rid of custom equality, but it seems more valuable to have custom equality in the language than this guarantee)
so since we already can't have that guarantee, the design questions for Eq
and toList
on dictionaries and sets become things like:
performance aside, unfortunately this is a case where making it less surprising makes it more error-prone
specifically, it seems common to use ==
on dictionaries and sets assuming that they'll be equal if they have all the same keys and values
especially in the case of sets - that's basically the definition of "set equality" outside of programming
so although it's surprising that toList a == toList b
might be false when a == b
it doesn't seem as likely to cause bugs as having ==
on the collections themselves sometimes return false even when the collections have the same contents (but they were inserted in a different order)
so unfortunately this is a situation where there does not exist a design that checks every box, and this design is the one we ended up with because it seems to be less error-prone than the less-surprising alternative
Outside the programming, sets have no order. Currently, the Set
looks like a structure
{
sortedUniqueItems: [a, b, c],
orderIndicies: [0, 2, 1]
}
When we iterate over it, we get the order a, c, b
combining both of the "fields" and when we hash it - we use only the sortedUniqueItems
part of the structure. So hashing algorithm here is an intentional collision. To avoid that, it should either use the whole structure or the structure shouldn't have the orderIndicies
But here is where the performance
point bites. insert
or toList
can potentially become more complex (to keep the sorted list)
I would suggest to drop the iteration out from both Dict
and Set
or provide an arbitrary order because order has no sense there conceptually but it would be too bold to suggest such change :D
My main concern is that it's not possible to have both insertion order
and equality by hash
for these structures (whatever is possible of corse, it's only a matter of convention. More on this below)
it's not possible to guarantee in the general case that
x == y
impliesf x == f y
The most interesting part in this snippet is the ==
. What I'm talking about is about a contract which the Eq
ability is. The ability should provide a clear meaning of itself - with this, we can say if the implementation is correct for the structure. The snippet can be rewritten this way: "A.Eq x y
doesn't imply the B.Eq (f x) (f y)
" and it will be correct because now we see that we have different equalities. But since Eq
are not functions of the A
and B
types but are implementation of the ability Eq
- the general case for the snippet should be guaranteed otherwise it's a violation of the contract and can be considered as a bug.
Shortly, yes, it's not possible to guarantee the general case, but it's possible for the standard library.
I am confused by your first example, what is {sortedUniqueItems, orderIndices}
supposed to be?
An abstract underlying structure, minimal implementation that corresponds the current behavior
Sure, but it is not representative of the perf characteristics. The current behavior is faster for some uses and slower for other uses. It is based on indexmap. They tested replacing every hashmap in rustc
with an indexmap and the perf was approximately the same. rustc
has a lot of perf benchmarking, so I think this is an exceptionally minimal tradeoff.
For example, insert
is basically the same cost and toList
is actually free.
Yeah, Richard told the same. I’m not talking about the performance already
As for arbitrary order there are some issues. First, you have to decide if the dict is randomly or statically seeded. Then based on that you hit different issues.
Statically seeded:
Randomly seeded:
x == y
implies f x == f y
is broken for the entire language.With insertion order, we can be randomly seeded without issues since the hashed order is never exposed to userland.
both
insertion order
andequality by hash
This is possible. We do it for Set
.
hmm, actually we don't currently do it for Set
, but we have hashUnordered which means we could do it.
more valuable to have custom equality in the language than this guarantee
I don't really get this one ... is there some way to define == such that the user never 'feels a missing need' for a custom == implementation _and_ they get a wider guarantee for more values x,y that x == y => f x == f y ?
Type-specific definitions for a notion of 'sameness comparison' that lose x eq y => f x eq f y trends more toward 'type specific' operations to me -- is there really a reason to force the variety of kinds of 'application specific sameness operations for a type' to share the == operator name ...?
What is the unifying design principle for == in roc such that folks implementing a custom == would be likely to minimize surprise when implementing a custom == for their types?
A simple example here that can break x == y => f x == f y
is any structure with a capacity or group of unused elements. If any information is exposed about the capacity or unused elements in any way, you can depend on that in f
to make the comparison no longer equal.
Kiryl Dziamura said:
it's not possible to guarantee the general case, but it's possible for the standard library.
sure, but as guarantees go, I don't think that's a particularly valuable one. :big_smile:
to me, the main value in guarantees is that they let you quickly rule things out and/or stop thinking about them altogether, without getting burned
"can I rely on a == b
implying f a == f b
? hang on, let me real quick just make sure that f
, a
, and b
only use builtin functions and data structures..."
I think as a Roc programmer, it's more helpful to have a mental model of "I can never rely on a == b
implying f a == f b
" so I don't really see much value in having that be true for builtins (and one could even argue it's a footgun because you might get accustomed to that being true, and then get burned when you discover it's not true after using a userspace opaque type with custom equality)
as a quick note, which might be helpful: here are some of the things that are important to me in this design decision:
so for example, "let's change it so that you are no longer permitted to do custom equality in Roc" is a big downside, and so I think there would need to be a correspondingly large upside on one of these dimensions to justify it
e.g. "people are seeing bugs all the time in Python because they also maintain dictionary insertion order but don't consider it during equality checks" would be a compelling point because it speaks to "how error-prone is it" (I haven't heard this from Python programmers though)
but like "this design just feels wrong to me" is not specific enough a reason to change it, given all the tradeoffs here...we've discussed a lot of alternative designs in the past, and they all had more serious downsides than "it feels wrong" (and often included that one too, in one way or another!)
also worth noting: "how surprising is the behavior" has an unfortunately high minimum amount of surprise here because...
toList a == toList b
is false when a == b
, that is surprising to some peoplea == b
were to be false just because the order was different, that would also be surprising to some peopleso either design would be surprising to one substantial group or another. Also, if we were to get rid of traversals altogether, that would be an extreme loss of functionality.
we could require toList
(and walk
, and any other traversals where ordering becomes observable) to take a sorting function or something like that, but that would have an extremely high performance cost for the most common case where all you want to do is to traverse the collection and don't care about order
No offense from my side. Your reasoning means that Eq
doesn't require the extensionality property. It’s strange to me but I can live with that 😊
fair enough! :smiley:
General question, do any languages have extensional equality across all types? I would assume that would be pretty impossible with reasonable ergonomics. You wouldn't be able to have any types with custom equality (at least to my understanding).
I doubt, will ask my friend Gepete :grinning_face_with_smiling_eyes:
I think it's in general an expected property, but can’t be mandatory. Btw, it would be nice to add the Set.toList
as an example of such behavior to the Eq
doc
Interesting! Thanks for the thoughts :)
Out of curiosity -- I wonder the degree to which the examples mentioned actually fully cover the design space for custom ==
eg
These seem like compelling examples ... are they actually the only ones that matter?
Sorry if Im asking folks to rediscuss something that has been debated sufficiently and well settled -- I just wandering into this discussion by wondering if there might be a design where less power could actually be "better" and more likely to maximize the benefit ... - like maybe custom == is not a general function users should be able to define for types -- but more like an optional description per type of some fields that are not observable for equality comparison and some fields that have a particular behavior with regard to sequences and ordering ...
Another example, though in a similar line is having a file and collision resistant large hash stored together. Then only comparing the hash instead of the entire file.
I have seen that in systems.
Also, in Roc, opaque types do not have equality by default. So any struct with an opaque type in it would lose equality without having custom equality on the opaque type.
But yeah, I think it mostly comes down to custom handling of fields (ignored, out of order, only going to length instead of capacity)
Brendan Hansknecht said:
A simple example here that can break
x == y => f x == f y
is any structure with a capacity or group of unused elements. If any information is exposed about the capacity or unused elements in any way, you can depend on that inf
to make the comparison no longer equal.
In the case of capacity the structures with the same payload but different capacities are equal as long as the capacity doesn't affect any properties of the structure. If capacity means allocation, which can be extended/panic when the capacity is exceeded, then capacity is a kind of metadata, not data itself, and shouldn't affect the eq check. On the other hand, if the capacity is a part of business logic and acts as a limiter that ignores new data when the cap is exceeded - it’s part of the payload.
I mean, capacity, length, empty items - they are parts of metadata if we are talking about a preallocated array. The same way the order is part of metadata for a set. Since the set has no order by its nature - it’s only a nuance of implementation.
The exposed capacity of an array or the exposed order of a set - they are exposure of implementation details. From that perspective, there are no problems with the insertion order.
I apologize for being picky and annoying. Just wanted to explain my intuition. I think we are on the same page. In the end of the day, what is considered as meta is the matter of the domain
Not picky and annoying. A good clarification of how you view it.
I do agree that the order of a set can just be an implementation detail (we have decided to expose it in a specific way).
I think where this gets complex is that people want to convert sets to lists or walk the elements of a set. Those are common use cases.
In order to avoid a large cost, exposing the set order explicitly is the cleanest option (at least from our view so far based on looking at tradeoffs. Totally open to consider other solutions)
If Roc wasn't a pure functional language, I definitely would push for a random order. Sadly, that leaks randomness into Roc userland: Insert 0 to 10 into a set; Convert it to a list; Get the first element; you now have a random number.
For security purposes, the problem with the weak hash system you outlined (predictable hashes) is a problem of the developer's expertise and choice of tool, I think. I believe there are crypto structures based on a stable but configurable entropy source. But it would be cumbersome to do in std, of course, thus it makes sense to have a predictable order in terms of reliability. I see only three ways to do so: insertion order, sorted order, and rnd based as a special case for the sorted. Since rnd variant requires an explicit input for the structure, and sorting requires the implementation of Ord
for the elements, the only simple choice is the insertion order that anyway allows sorting afterward.
So all points in the thread make sense to me.
Just to clarify, the problem isn't with a weak hash system. The problem is having a known hash system and seed (anyone can look at the roc source code and find out what it is)
Even if we were to switch to a cryptographically secure hash, this attack is totally doable due to the limited capacity of a dict/set. On top of that a cryptographically secure hash is so slow that it isn't reasonable to begin with.
We did chat a good bit about requiring a userland specified seed. Which could alleviate the larger problem. The issue is that not all platforms provide a way to get randomness and if a user hardcodes a seed cause it is convenient, they are back to the same issue.
Aside: maybe not yet in the latest nightly, but Set hashing and Dict hashing + equality should now fully work.
So sets as dict keys and checking if dict1 == dict2 should now work. Though dict only has Eq if the value it holds has Eq.
I understand the problem with public algo, just used too general wording.
A great point regarding the platform constraints on crypto security btw!
Last updated: Jul 06 2025 at 12:14 UTC