Stream: ideas

Topic: Hash Maps & Sets


view this post on Zulip Derek Gustafson (Mar 07 2022 at 20:52):

Honestly, I'm excited for abilities to be implemented in the compiler so that I can start testing out building types using abilities to see what kind of performance we get

view this post on Zulip Brendan Hansknecht (Mar 07 2022 at 20:52):

Shouldn't they be orthogonal to performance?

view this post on Zulip Brendan Hansknecht (Mar 07 2022 at 20:53):

It is more equivalent to enabling function overloading at the end of the day

view this post on Zulip Brendan Hansknecht (Mar 07 2022 at 20:54):

So naming and generic behavior, but today you could just write n versions of the function for different types.

view this post on Zulip Derek Gustafson (Mar 07 2022 at 20:54):

Having used mutable vectors in Haskell, I was really impressed with quicksort in Roc, and want to see what other pathological algorithms can be implemented without sacrificing performance

view this post on Zulip Brendan Hansknecht (Mar 07 2022 at 20:55):

For sure

view this post on Zulip Derek Gustafson (Mar 07 2022 at 20:55):

There's the Dict/Set discussion: should they be written in Roc, or Rust?

view this post on Zulip Derek Gustafson (Mar 07 2022 at 20:56):

At the moment you can't specify the an appropriate type for Dict

view this post on Zulip Brendan Hansknecht (Mar 07 2022 at 20:58):

Ah, sure, and that is for the generic nature of it. Theoretically someone today could write DictIntStr, but it would be specific instead of generic. If written in roc, it would be almost identical to the final Dict a b.

view this post on Zulip Derek Gustafson (Mar 07 2022 at 21:00):

Yeah, that sounds right

view this post on Zulip Brendan Hansknecht (Mar 07 2022 at 21:01):

Also, would Dict just be Dict a b | a supports Hashing

view this post on Zulip Ayaz Hafiz (Mar 07 2022 at 21:02):

Yes

view this post on Zulip Derek Gustafson (Mar 07 2022 at 21:02):

Depends on the algorithm. either supports Hashing or supports Eq

view this post on Zulip Brendan Hansknecht (Mar 07 2022 at 21:02):

Oh actually, probably Hashing and Eq for most algorithms.

view this post on Zulip Derek Gustafson (Mar 07 2022 at 21:03):

I think You need Eq to imply Hashing

view this post on Zulip Brendan Hansknecht (Mar 07 2022 at 21:05):

To be useful, that is probably true, though I don't think it is technically required.

view this post on Zulip Derek Gustafson (Mar 07 2022 at 21:06):

You're right. I was misremembering the abilities doc

view this post on Zulip Brendan Hansknecht (Mar 07 2022 at 21:07):

Also, since we just have a generic supports Hashing, does that mean that each type essentially would have to implement it's own hashing algorithm? Like could one type implement a different hashing algorithm than another and would that cause problems?

view this post on Zulip Brendan Hansknecht (Mar 07 2022 at 21:08):

I guess in reality most users would just call the hash method for builtin types plus a hash combining method, so they would all use the same underlying hash algorithm in the end.

view this post on Zulip Derek Gustafson (Mar 07 2022 at 21:09):

My gut reaction is that you wouldn't compare the hashes from two different types.

view this post on Zulip Brendan Hansknecht (Mar 07 2022 at 21:09):

Oh, this is more for creating the hash for a struct or aggregate type.

view this post on Zulip Ayaz Hafiz (Mar 07 2022 at 21:09):

also worth noting that the output of the hash would be an opaque 64 bits, and that you could only define custom hashes for opaque types. so the hash algorithm is less of a concern since it’s opaque in two ways

view this post on Zulip Richard Feldman (Mar 07 2022 at 21:10):

oh yeah, I'd love a separate thread to talk about hash maps and sets in Roc!

view this post on Zulip Richard Feldman (Mar 07 2022 at 21:10):

I'd be very interested in seeing what those could look like in user space - @Derek Gustafson is that something you're interested in collaborating on?

view this post on Zulip Derek Gustafson (Mar 07 2022 at 21:11):

Moving the discussion to it's own thread

view this post on Zulip Derek Gustafson (Mar 07 2022 at 21:12):

@Richard Feldman Absolutely, I would be interested in working on maps and sets

view this post on Zulip Derek Gustafson (Mar 07 2022 at 21:13):

But, honestly. I'd probably be basing them off of elm or Haskell

view this post on Zulip Brendan Hansknecht (Mar 07 2022 at 21:18):

I would be up helping to make a user space version. We could either wait for abilities or implement something before abilities are in by hardcoding the input hashable type.

view this post on Zulip Derek Gustafson (Mar 07 2022 at 21:22):

You folks have made me even more excited for this. If no one's started it by then, I'll probably give it a go tomorrow night

view this post on Zulip Ayaz Hafiz (Mar 07 2022 at 21:23):

realistically we are at least 4-5 weeks away from being able to hash builtin types via abilities. i hope to get custom type abilities in sooner (at which point i think we could massively parallelize work in migrating the standard library to abilities, which would be cool if folks are in). but im giving a very aggressive timeline and it shouldn’t really impact the implementation of dict/set other than at the api surface

view this post on Zulip Brendan Hansknecht (Mar 07 2022 at 21:24):

So mess around in user space it is.

view this post on Zulip Brendan Hansknecht (Mar 07 2022 at 21:25):

We can just start with ints as the keys and xoring as the hash. That should be enough for writing most of the core of an actual hash map.

view this post on Zulip Richard Feldman (Mar 07 2022 at 21:37):

yeah I think you could do all the time-consuming parts without abilities actually being implemented yet

view this post on Zulip Richard Feldman (Mar 07 2022 at 21:37):

like make an IntDict, StrDict, etc. and write the hash implementations for each of them

view this post on Zulip Richard Feldman (Mar 07 2022 at 21:37):

and then once abilities exist, combining them into Dict!

view this post on Zulip Richard Feldman (Mar 07 2022 at 21:38):

I think Elm's Dict is a good starting point

view this post on Zulip Richard Feldman (Mar 07 2022 at 21:38):

also Set

view this post on Zulip Folkert de Vries (Mar 07 2022 at 21:43):

those are just red-black trees. Our examples in /benchmarks have some code based on Lean benchmarks testing various RBTree functions

view this post on Zulip Folkert de Vries (Mar 07 2022 at 21:43):

but that's not really a hash map right?

view this post on Zulip Richard Feldman (Mar 07 2022 at 21:46):

oh sorry I meant from an API perspective, not from an implementation perspective :big_smile:

view this post on Zulip Richard Feldman (Mar 07 2022 at 21:46):

and yeah we'd want to have a hash constraint instead of an ordering constraint

view this post on Zulip Richard Feldman (Mar 07 2022 at 21:57):

one interesting question to begin with is: what should the API be for making a dictionary from a known list of key/value pairs?

view this post on Zulip Richard Feldman (Mar 07 2022 at 21:58):

like one answer is:

Dict.empty
    |> Dict.insert "foo" foo
    |> Dict.insert "bar" bar
    |> Dict.insert "baz" baz

view this post on Zulip Richard Feldman (Mar 07 2022 at 21:58):

and another is something like:

Dict.fromList [
    { k: "foo", v: foo },
    { k: "bar", v: bar },
    { k: "baz", v: baz },
]

view this post on Zulip Richard Feldman (Mar 07 2022 at 21:59):

although by default the performance would be worse in the latter case because it would have to allocate an intermediate list and then throw it away

view this post on Zulip Richard Feldman (Mar 07 2022 at 21:59):

although there are interesting theoretical potential optimizations the compiler could do someday

view this post on Zulip Richard Feldman (Mar 07 2022 at 21:59):

e.g. inlining plus "if all you ever do with a list is fold over it, then unroll the loop and just don't bother creating the list at all"

view this post on Zulip Richard Feldman (Mar 07 2022 at 21:59):

could make the intermediate list disappear

view this post on Zulip Richard Feldman (Mar 07 2022 at 22:00):

but we certainly don't have anything like that today :big_smile:

view this post on Zulip Ayaz Hafiz (Mar 07 2022 at 22:02):

yeah that seems like a prime opportunity for a deforestation optimization to shine

view this post on Zulip Richard Feldman (Mar 07 2022 at 22:21):

someday! :grinning_face_with_smiling_eyes:

view this post on Zulip Ayaz Hafiz (Mar 07 2022 at 22:26):

another potential fun compiler project for someone!

view this post on Zulip Folkert de Vries (Mar 07 2022 at 22:39):

current mono IR makes that kinda hard, my idea for an improved mono IR makes it more feasible

view this post on Zulip Folkert de Vries (Mar 07 2022 at 22:40):

it allows you to walk up the tree of statements again, currently we can only go down

view this post on Zulip Folkert de Vries (Mar 07 2022 at 22:40):

or well, it allows us to do this efficiently

view this post on Zulip Brendan Hansknecht (Mar 07 2022 at 23:02):

So I just realized something. I think that a hashmap needs to be refcounted like lists, otherwise it will do tons of copying and have terrible performance. So would that mean it needs to be a builtin?

view this post on Zulip Brendan Hansknecht (Mar 07 2022 at 23:02):

Oh wait....nvm. the underlying implementation of the hashmap will be a list, so it will have the refcounting.

view this post on Zulip Brendan Hansknecht (Mar 07 2022 at 23:03):

Also, do we have any preference on the actual underlying implementation?

view this post on Zulip Ayaz Hafiz (Mar 07 2022 at 23:08):

may be worth prototyping several and seeing which works best

view this post on Zulip Brendan Hansknecht (Mar 07 2022 at 23:10):

Will be interesting to see which work well with roc in general. Though also, there is the complexity of performance depending on the size of the type in the hashmap and other details like that.

view this post on Zulip Richard Feldman (Mar 07 2022 at 23:18):

it may end up making sense to have multiple implementations

view this post on Zulip Richard Feldman (Mar 07 2022 at 23:18):

that perform better depending on use case

view this post on Zulip Richard Feldman (Mar 07 2022 at 23:19):

for most of my career I've always reached for whatever dictionary/set data structure was most commonly used

view this post on Zulip Richard Feldman (Mar 07 2022 at 23:19):

but working on the Roc compiler I've started to see the performance drawbacks of doing that :sweat_smile:

view this post on Zulip Richard Feldman (Mar 07 2022 at 23:20):

so I kinda like the idea of not having a default dictionary that's encouraged to be the thing everyone uses

view this post on Zulip Richard Feldman (Mar 07 2022 at 23:20):

and instead encouraging people to choose one based on their use case

view this post on Zulip Brendan Hansknecht (Mar 08 2022 at 00:11):

I agree except that I think it should be a recommend default with explicit guidelines around when not to use it.

I think that many users don't have the expertise to pick the correct map or necessarily want to dig into it.

view this post on Zulip Brendan Hansknecht (Mar 08 2022 at 00:17):

Expertise may not be the best wording. Many programmers could follow rules of thumbs to pick the right map, but that doesn't mean they really understand the choice or exact tradeoffs.

view this post on Zulip Jared Cone (Mar 08 2022 at 00:32):

Recommended default will also increase compatibility between libraries. Having a standard makes it easier to pass data around without having to do conversions.

view this post on Zulip Richard Feldman (Mar 08 2022 at 01:31):

true, although it worries me a bit if the standard is something other than List

view this post on Zulip Richard Feldman (Mar 08 2022 at 01:31):

like it'd be pretty easy to standardize on suboptimal performance that way

view this post on Zulip Richard Feldman (Mar 08 2022 at 01:31):

maybe that's fine though

view this post on Zulip Richard Feldman (Mar 08 2022 at 01:32):

in those use cases

view this post on Zulip Richard Feldman (Mar 08 2022 at 01:58):

regardless, that's not something we have to figure out right away

view this post on Zulip Richard Feldman (Mar 08 2022 at 01:58):

gotta start somewhere! :smiley:

view this post on Zulip Brendan Hansknecht (Mar 08 2022 at 02:46):

What do you mean by "the standard is something other than List"?

view this post on Zulip Richard Feldman (Mar 08 2022 at 03:16):

let's suppose there's a standard Dict type that everyone uses

view this post on Zulip Richard Feldman (Mar 08 2022 at 03:16):

and let's also suppose I'm making a HTTP library

view this post on Zulip Richard Feldman (Mar 08 2022 at 03:16):

that parses and provides you with all the HTTP headers

view this post on Zulip Richard Feldman (Mar 08 2022 at 03:16):

what format would it use for that?

view this post on Zulip Richard Feldman (Mar 08 2022 at 03:16):

one answer is List Header

view this post on Zulip Richard Feldman (Mar 08 2022 at 03:16):

(where Header includes the name of the header)

view this post on Zulip Richard Feldman (Mar 08 2022 at 03:17):

another is Dict Str Header (where Header perhaps does not include the name, and that's just the key in the dict)

view this post on Zulip Richard Feldman (Mar 08 2022 at 03:17):

a third answer is Headers which has a Headers.get : Str -> Result Header [ HeaderNotFound ]*

view this post on Zulip Richard Feldman (Mar 08 2022 at 03:17):

it's reasonably likely that of the three designs, the Dict one would have the worst performance

view this post on Zulip Richard Feldman (Mar 08 2022 at 03:18):

the Headers one lets you abstract away the internal representation, which might be the best

view this post on Zulip Richard Feldman (Mar 08 2022 at 03:19):

especially in case the author of the HTTP library wants to mitigate hash flooding attacks

view this post on Zulip Richard Feldman (Mar 08 2022 at 03:19):

which a standard Dict implementation might or might not want to address!

view this post on Zulip Richard Feldman (Mar 08 2022 at 03:19):

that's a very niche worry that really only affects servers

view this post on Zulip Richard Feldman (Mar 08 2022 at 03:20):

there's no real need for (say) a Dict being used in a CLI to include hash flooding mitigations, because DoSing a CLI isn't really a thing :big_smile:

view this post on Zulip Richard Feldman (Mar 08 2022 at 03:20):

so this would be an example where I think maybe having a default Dict everyone is accustomed to reaching for might lead to a worse outcome

view this post on Zulip Brendan Hansknecht (Mar 08 2022 at 03:55):

So, if we don't have a default, I would expect the average roc user to search the package registry for Map, Dict, or HashMap, and then just take the first package that has the most use overall. I don't think most users that want a Dict tend to think about the tradeoffs. They simply want a well implemented key value store.

I think adding a default helps to defend against people picking the wrong dictionary. If we provide a well documented default that has explicit documentation around when it won't work as well, that will help users switch to other dictionaries in those cases. So the documentation would look something like this:

  1. Almost always use a flat hash map. It is fast and has a nice memory footprint.
  2. If the hash map will always be small consider a flat map. It is just a list of key value pairs.
  3. If the value is large, consider storing the box of the value instead of the value itself.
  4. If the key is large, consider switching to a node hash map.
  5. If you are in a situation where hash flooding might happen due to targeted user input, consider x special hash map that uses a seed and randomizes the hashing algorithm for each new map (this is what go does to fight this)

Then, when a user goes to the default Dict, they will learn about hash flooding and have an alternatives to consider. Without this guidance, many users are even more likely to pick the wrong map. I think that a default with guidance is much more useful than no default at all.

view this post on Zulip Brendan Hansknecht (Mar 08 2022 at 03:56):

Also, sorry for constantly saying hash map instead of dict....just habit

view this post on Zulip Brendan Hansknecht (Mar 08 2022 at 03:58):

As an extra note, this makes me really question the supports Hashing ability. The issue is that for many specialized Dicts, you will want to use a specific hashing algorithm. So unless support Hashing takes a list of primitive hashing functions or something of that nature, you won't be able to swap out the algorithm.

view this post on Zulip Richard Feldman (Mar 08 2022 at 03:59):

yeah that's how Rust does it https://doc.rust-lang.org/std/hash/index.html#traits

view this post on Zulip Richard Feldman (Mar 08 2022 at 03:59):

the Rust Hash trait is not algorithm-specific

view this post on Zulip Brendan Hansknecht (Mar 08 2022 at 03:59):

Also, If someone makes an HTTP library, I would hope that they can just expose Headers.get as you mentioned. Then they can pick whatever Dict or not that they see as suitable. End users would never have to think about it.

view this post on Zulip Richard Feldman (Mar 08 2022 at 04:00):

it just provides a way to specify "given a hashing function, here's a way to incorporate all my data into it"

view this post on Zulip Richard Feldman (Mar 08 2022 at 04:01):

yeah, so maybe the answer is to have a roc/dict package that's a reasonable default, but also provide guidance to platform authors not to use it for HTTP headers :big_smile:

view this post on Zulip Brendan Hansknecht (Mar 08 2022 at 04:02):

Could a dict be parameterized on the hashing algorithm? I think so. If that is the case, then it may be possible to make a resistant dict pretty easily as an expansion on top of the standard dict. Though the paramterization may not be zero cost like in rust, so that may be problematic.

view this post on Zulip Richard Feldman (Mar 08 2022 at 04:05):

oh I think defending against hash flooding should be done at the hash map implementation level, not the hashing algorithm level

view this post on Zulip Richard Feldman (Mar 08 2022 at 04:06):

I talked about my reasoning for that in this issue

view this post on Zulip Brendan Hansknecht (Mar 08 2022 at 04:07):

I guess it depends the level of defense you want/need. In that case, it could still be a different package. I would expect at least 4 different Dict implementations that are relatively common. linear list based key value parse, some sort of flat hash map, some sort of node hash map, and some sort of flooding resistant hash map.

view this post on Zulip Richard Feldman (Mar 08 2022 at 04:08):

also, the only flooding-resistant hashing algorithms I know of require side effects for randomness

view this post on Zulip Brendan Hansknecht (Mar 08 2022 at 04:08):

Just have a seed as part of their new function?

view this post on Zulip Brendan Hansknecht (Mar 08 2022 at 04:09):

I think that is enough

view this post on Zulip Richard Feldman (Mar 08 2022 at 04:09):

oh sure, but then the Dict actually needs to store that, right?

view this post on Zulip Richard Feldman (Mar 08 2022 at 04:10):

I guess you could hardcode it as part of the ability implementation and make sure to keep it secret from attackers :thinking:

view this post on Zulip Brendan Hansknecht (Mar 08 2022 at 04:10):

For sure, but if it is a user space library, it can either wrap a regular dict and store that, or it can make its own version and store that. Rather small cost.

view this post on Zulip Brendan Hansknecht (Mar 08 2022 at 04:10):

1 U64 per dict

view this post on Zulip Richard Feldman (Mar 08 2022 at 04:11):

fair enough, but it's both more ergonomic and more performant to have the hash map itself implement the flooding mitigations :big_smile:

view this post on Zulip Richard Feldman (Mar 08 2022 at 04:12):

I am perhaps unreasonably excited about that idea because it seems like a very clear-cut case where every mainstream language is doing something suboptimal in multiple ways, and I like the idea of doing it better and advancing the state of the art :smiley:

view this post on Zulip Brendan Hansknecht (Mar 08 2022 at 04:15):

Even most c++ standard libraries do this horribly. Default int hashing algorithm of do nothing.

view this post on Zulip Richard Feldman (Mar 08 2022 at 04:16):

whaaaaat

view this post on Zulip Brendan Hansknecht (Mar 08 2022 at 04:17):

Also, I am definitely concerned that we won't be able to implement a max performance Dict in roc itself. Most of them at some point or another depend on specific assembly instructions for bit twiddling and performance.

view this post on Zulip Brendan Hansknecht (Mar 08 2022 at 04:18):

But still I think this will be a fun project and interesting to see how far it can be pushed in Roc itself.

view this post on Zulip Richard Feldman (Mar 08 2022 at 04:20):

yeah I think that's probably right and I also think it's probably ok

view this post on Zulip Brendan Hansknecht (Mar 08 2022 at 04:20):

In which sense, Roc implementation performance will be good enough or that we can fallback to other things if it is too slow.

view this post on Zulip Richard Feldman (Mar 08 2022 at 04:21):

I suspect at the point where an absolute max performance dict is really a must-have, using a language like Roc at all is probably off the table anyway

view this post on Zulip Brendan Hansknecht (Mar 08 2022 at 04:22):

Oh, also the python hash function for ints is identity as well apparently.

view this post on Zulip Richard Feldman (Mar 08 2022 at 04:22):

and I suspect we can be competitive with other languages with automatic memory management like Go, Java, C#

view this post on Zulip Richard Feldman (Mar 08 2022 at 04:22):

even with a pure Roc implementation

view this post on Zulip Brendan Hansknecht (Mar 08 2022 at 04:22):

I think that all sounds accurate

view this post on Zulip Richard Feldman (Mar 08 2022 at 04:22):

I'm definitely open to the idea that maybe we should have a Dict as a builtin for performance reasons, but I suspect it won't be necessary in practice :big_smile:

view this post on Zulip Richard Feldman (Mar 08 2022 at 04:23):

and one big reason I like not having it as a builtin is that it means we don't have to do a language-level breaking change if we ever want to change the hashing algorithm

view this post on Zulip Richard Feldman (Mar 08 2022 at 04:25):

because if there's one thing in programming that is an absolute unwavering certainty, it's that wherever there are hash maps, there are programs whose correctness unintentionally depends on the ordering that happens to be generated by their particular hashing functions

view this post on Zulip Richard Feldman (Mar 08 2022 at 04:26):

also people keep coming out with new hashing functions that are faster than the old ones, so there's motivation to upgrade too :big_smile:

view this post on Zulip Brendan Hansknecht (Mar 08 2022 at 04:27):

Might be a good idea to always take a random seed so that iteration can have random order so that users can't depend on it.

view this post on Zulip Brendan Hansknecht (Mar 08 2022 at 04:27):

Except that they can feed a non-random seed....so nvm

view this post on Zulip Richard Feldman (Mar 08 2022 at 04:28):

yeah Go apparently does something like that

view this post on Zulip Richard Feldman (Mar 08 2022 at 04:28):

that would obviously break referential transparency though haha

view this post on Zulip Richard Feldman (Mar 08 2022 at 04:29):

but yeah, I think it's much more okay for a package to say "here's a new version of this collection, it uses a new hashing algorithm; update to it only when you're ready" than for a language to do it

view this post on Zulip Richard Feldman (Mar 08 2022 at 04:30):

because in the package case, you can always continue using the old version in any places that break with the new algorithm, and fix them (or not) at your leisure

view this post on Zulip Richard Feldman (Mar 08 2022 at 04:30):

whereas at the language level, you're blocked on using the new language release until you can address the bugs from the changes, which is a much bigger concern!

view this post on Zulip Richard Feldman (Mar 08 2022 at 04:31):

that in turn creates an incentive for the language never to change its builtin hashing function, whereas packages can do it without that problem

view this post on Zulip Brendan Hansknecht (Mar 08 2022 at 04:33):

yep

view this post on Zulip Derek Gustafson (Mar 08 2022 at 11:19):

I believe the reason Python chose the identity for it's integer hash function is that the common use case for int keys is using a dict as an array, and they get better average performance that way

view this post on Zulip Derek Gustafson (Mar 08 2022 at 11:36):

I had a thought. What if we remove iteration from the hashmap API. We could have a Red Black map that provides the full API, and comments about memory usage and performance being better for hashmaps, at the cost of the API

view this post on Zulip Derek Gustafson (Mar 08 2022 at 22:29):

I've opened a WIP PR for HashMaps written in Roc. At the moment it's basically just the API. I would appreciate people taking a look.
Are there names that could be better?
Is the order of arguments consistent with Roc's philosophy on the piping operator?

view this post on Zulip Jared Cone (Mar 08 2022 at 22:56):

All the names make sense to me at least. I hesitated with len for hashmap. I can see it being used with List or Array because they are contiguous and linear, but stuck out to me in the context of a hashmap. Maybe I'm just used to Num or Count with them. But since List already establishes len I think it makes sense to stick with that.

view this post on Zulip Folkert de Vries (Mar 08 2022 at 22:56):

is not exposing things like toList really a solution? doesn't that take away much of the value of these data structures?

view this post on Zulip Folkert de Vries (Mar 08 2022 at 22:57):

people misusing the ordering will reap what they sow

view this post on Zulip Jared Cone (Mar 08 2022 at 22:58):

Maybe toList could take in a sort function as a parameter?

view this post on Zulip Derek Gustafson (Mar 08 2022 at 23:04):

@Jared Cone I took that name from Richard's documentation for an builtin Dict

view this post on Zulip Derek Gustafson (Mar 08 2022 at 23:06):

@Folkert de Vries Having watched the Python2/3 divide, I'm worried that creating a barrier to update (if the hash function is updated) is asking for a divide in Roc's userbase.

view this post on Zulip Derek Gustafson (Mar 08 2022 at 23:09):

@Folkert de Vries There might be an argument for a second Map package, OrderedMap. If you need the asymptotic performance use HashMap; If you want to be able to convert the Map to a List, use the OrderedMap.

view this post on Zulip Folkert de Vries (Mar 08 2022 at 23:10):

say I want to display this dict in some UI, what do I do?

view this post on Zulip Derek Gustafson (Mar 08 2022 at 23:12):

How big can your UI be? Do you really need the asymptotic performance?

view this post on Zulip Folkert de Vries (Mar 08 2022 at 23:12):

maybe the value comes from a package

view this post on Zulip Derek Gustafson (Mar 08 2022 at 23:14):

My gut reaction is that that means the package has a broken API

view this post on Zulip Folkert de Vries (Mar 08 2022 at 23:15):

maybe the package is for the general case where speed matters

view this post on Zulip Folkert de Vries (Mar 08 2022 at 23:16):

in any case, for a basic data structure, I want to be able to get my hands on the data again. That's the whole point

view this post on Zulip Folkert de Vries (Mar 08 2022 at 23:17):

btw rust fixes this by parameterizing over the hasher, which is a possibility here

view this post on Zulip Derek Gustafson (Mar 08 2022 at 23:19):

That's an interesting idea.

view this post on Zulip Derek Gustafson (Mar 08 2022 at 23:19):

Would that be a performance hit?

view this post on Zulip Derek Gustafson (Mar 08 2022 at 23:21):

The indirect call to the hash function?

view this post on Zulip Folkert de Vries (Mar 08 2022 at 23:21):

not when you use LLVM for optimizations

view this post on Zulip Folkert de Vries (Mar 08 2022 at 23:21):

we have no indirect calls anyway

view this post on Zulip Folkert de Vries (Mar 08 2022 at 23:22):

but it would introduce extra calls I think, which LLVM will inline away again

view this post on Zulip Richard Feldman (Mar 08 2022 at 23:25):

another idea would be to name the data structure after its hashing function, e.g. call it FnvMap if it uses FNV-1a

view this post on Zulip Folkert de Vries (Mar 08 2022 at 23:25):

just to clarify, my idea is that you would have

# LowLevel.HashMap.roc

HashMap hasher key value : ...

empty : hasher -> HashMap hasker key value
empty = \hasher -> ...

# HashMap.roc

HashMap key value : LowLevel.HashMap.HashMap DefaultHasher key value

empty : HashMap key value
empty = LowLevel.HashMap.empty defaultHasher

view this post on Zulip Richard Feldman (Mar 08 2022 at 23:25):

or WyMap if it uses wyhash

view this post on Zulip Folkert de Vries (Mar 08 2022 at 23:26):

but as a user, how do I pick which one I want?

view this post on Zulip Folkert de Vries (Mar 08 2022 at 23:26):

the answer to "I want to map keys to values with deduplication" should be simple

view this post on Zulip Folkert de Vries (Mar 08 2022 at 23:27):

"let's learn about hashing" isn't all that much better than "let's learn about monads"

view this post on Zulip Richard Feldman (Mar 08 2022 at 23:27):

the answer to "I want to map keys to values with deduplication" should be simple

of course, a plausble answer to that is AssocList which doesn't even use hashing :wink:

view this post on Zulip Folkert de Vries (Mar 08 2022 at 23:29):

AssocList is weird because, yes it is the simple solution, but then it has horrible big-O complexity (which we all learn is bad), but then in practice the performance can actually be better than the structures with better big-O

view this post on Zulip Folkert de Vries (Mar 08 2022 at 23:29):

AssocArrayList would be neat

view this post on Zulip Derek Gustafson (Mar 08 2022 at 23:35):

I feel like this brings us back to what I was thinking. No toList for HashMap, but we could have an assocList with a map API for if you want to convert to a list.

view this post on Zulip Folkert de Vries (Mar 08 2022 at 23:37):

again I give you my example of you use some random package that gives you one of these closed data types, and you're just stuck.

view this post on Zulip Folkert de Vries (Mar 08 2022 at 23:38):

I mean, it's fine for an experiment for now, but that's not what I'd want the "chosen one" semi-official key-value map implementation to look like

view this post on Zulip Richard Feldman (Mar 08 2022 at 23:46):

what's AssocArrayList?

view this post on Zulip Richard Feldman (Mar 08 2022 at 23:47):

related: could something with an Ordering constraint be faster than hash maps for small collections and still reasonable (but slower than hash maps) for large ones?

view this post on Zulip Richard Feldman (Mar 08 2022 at 23:48):

Ordering never changes

view this post on Zulip Folkert de Vries (Mar 08 2022 at 23:48):

the idea is to store keys and values in separate arrays

view this post on Zulip Folkert de Vries (Mar 08 2022 at 23:48):

to speed up membership checks

view this post on Zulip Folkert de Vries (Mar 08 2022 at 23:48):

at the potential cost of 2 bounds checks instead of 1

view this post on Zulip Folkert de Vries (Mar 08 2022 at 23:49):

I think it could be reasonable for small keys

view this post on Zulip Folkert de Vries (Mar 08 2022 at 23:50):

where simd can just tear through the keys array

view this post on Zulip Derek Gustafson (Mar 08 2022 at 23:53):

RedBlackTries have O(log n) lookup and insert

view this post on Zulip Folkert de Vries (Mar 08 2022 at 23:55):

yes but you're chasing pointers on the heap. We've seen linear scans of an array of keys win for even very large sets/maps

view this post on Zulip Folkert de Vries (Mar 08 2022 at 23:55):

for 32-bit keys, that's important

view this post on Zulip Folkert de Vries (Mar 08 2022 at 23:56):

if your key is some more complex structure linear scan isn't as good, but still will win up to surprisingly large collections

view this post on Zulip Derek Gustafson (Mar 08 2022 at 23:56):

In order to write the HashMap in Roc, it was going to be a Hash Array mapped tree. Which would also require chasing pointers

view this post on Zulip Folkert de Vries (Mar 08 2022 at 23:56):

you can play with the branching factor though right?

view this post on Zulip Derek Gustafson (Mar 08 2022 at 23:56):

Yeah

view this post on Zulip Derek Gustafson (Mar 08 2022 at 23:57):

I was looking at Haskell's implementation. They claimed base 16, but I think it was actually 32

view this post on Zulip Folkert de Vries (Mar 08 2022 at 23:57):

and the size of the array is usually kinda large I believe, like 32 or 64 elements?

view this post on Zulip Derek Gustafson (Mar 08 2022 at 23:58):

I think so

view this post on Zulip Folkert de Vries (Mar 08 2022 at 23:58):

"large" is relative here, relative to the standard 1 of binary trees

view this post on Zulip Richard Feldman (Mar 09 2022 at 00:25):

isn't a Relaxed Radix Balanced Tree similar to a HAMT except Ordering based?

view this post on Zulip Richard Feldman (Mar 09 2022 at 00:27):

cc @Robin Heggelund Hansen

view this post on Zulip Richard Feldman (Mar 09 2022 at 00:27):

I guess those are persistent data structures though, not mutate-in-place

view this post on Zulip Richard Feldman (Mar 09 2022 at 00:36):

another thing to consider: something I've seen requested in other languages (but have never personally used) is a request to have iteration follow the order in which the entries were originally inserted

view this post on Zulip Richard Feldman (Mar 09 2022 at 00:37):

I've heard when Python switched to guaranteeing this, there was no performance impact

view this post on Zulip Richard Feldman (Mar 09 2022 at 00:38):

presumably the way they would have done that is to maintain a separate array of keys in insertion order

view this post on Zulip Richard Feldman (Mar 09 2022 at 00:38):

I guess if they wanted to make removals O(1) they'd also need a separate hash map from keys to indices in the insertion order array :sweat_smile:

view this post on Zulip Richard Feldman (Mar 09 2022 at 00:39):

but I suppose that's worth considering - if you have "hash map except it preserves insertion order," then it's strictly slower than an unordered hash map, but you can change the algorithm at will

view this post on Zulip Richard Feldman (Mar 09 2022 at 00:40):

and the asymptotics can presumably be the same

view this post on Zulip Richard Feldman (Mar 09 2022 at 00:41):

plus you could still do the "associative array list at small sizes" thing, which would also maintain insertion order of course - and would presumably be faster than a small hash map

view this post on Zulip Brendan Hansknecht (Mar 09 2022 at 00:48):

Just a note, I am going to try and implement something quite similar to the absl::flat_hash_map directly in roc. I think it should be doable.

view this post on Zulip Brendan Hansknecht (Mar 09 2022 at 00:53):

Derek Gustafson said:

Folkert de Vries Having watched the Python2/3 divide, I'm worried that creating a barrier to update (if the hash function is updated) is asking for a divide in Roc's userbase.

Personally, I don't think removing toList is the solution here. Packages are not part of the standard library and will have versioning. If a user depends on an implementation detail, they may get hurt when the package update. That is completely fine. The user can choose not to upgrade. Personally, I am debating using the version of the library as a seed for all hashing so that every update the order is randomize no matter that hashing function chosen.

view this post on Zulip Brendan Hansknecht (Mar 09 2022 at 00:54):

Derek Gustafson said:

In order to write the HashMap in Roc, it was going to be a Hash Array mapped tree. Which would also require chasing pointers

Why? Why do you need any sort of tree structure?

view this post on Zulip Brendan Hansknecht (Mar 09 2022 at 00:57):

Richard Feldman said:

I've heard when Python switched to guaranteeing this, there was no performance impact

Python dict is really slow to begin with, so having no performance impact may be relative?

view this post on Zulip Derek Gustafson (Mar 09 2022 at 00:59):

Probably because I'm still trying to wrap my head around the mutable functional programming that Roc has created

view this post on Zulip Brendan Hansknecht (Mar 09 2022 at 01:00):

@Folkert de Vries Do you think hat taking the hasher would be optimized well in Roc. I am worried that unlike rust or C++ where it is compile time optimized away that in roc it would leave in a lot of switches and function calls. I assume the hasher would just be a struct of closures that know how to hash every primitive.

view this post on Zulip Richard Feldman (Mar 09 2022 at 01:01):

@Derek Gustafson the way we've been designing everything (which has worked out so far!) is basically to assume everything is going to be mutated in place at runtime

view this post on Zulip Richard Feldman (Mar 09 2022 at 01:01):

in other words, assume the optimization will kick in as often as possible

view this post on Zulip Richard Feldman (Mar 09 2022 at 01:04):

like for example, assume that what List.set will do is mutate the given list in place and then return the mutated one

view this post on Zulip Richard Feldman (Mar 09 2022 at 01:05):

and then be careful not to defeat that optimization unnecessarily, by (for example) storing the same list in two places, causing it to have a reference count greater than 1

view this post on Zulip Richard Feldman (Mar 09 2022 at 01:07):

but even if that does happen, it may just get cloned once (by List.set for example, which will see that it has a refcount > 1 and therefore ineligible for in-place mutation) and then after the cloning, you end up with two distinct lists, one of which has been mutated, and both of which now have a refcount of 1

view this post on Zulip Richard Feldman (Mar 09 2022 at 01:07):

so subsequent operations on the same list can be done in-place

view this post on Zulip Derek Gustafson (Mar 09 2022 at 01:09):

@Richard Feldman Yeah, I know. That's what makes me so excited for Roc. But the intuition isn't there yet

view this post on Zulip Brendan Hansknecht (Mar 09 2022 at 01:11):

Also, do we have anyway to make a default or dummy element with making a tag union and wasting space? For List backed dictionaries, they will have plenty of empty slots that require some sort of value.

view this post on Zulip Brendan Hansknecht (Mar 09 2022 at 01:11):

I don't necessarily want a user to have to pass in a default element that we store.

view this post on Zulip Brendan Hansknecht (Mar 09 2022 at 01:14):

My current idea to avoid this is to just duplicate existing items in order to avoid need a default.

view this post on Zulip Derek Gustafson (Mar 09 2022 at 01:36):

@Richard Feldman Just took a look at Python's Dict implementation. They have a hash table of indices into an array of key/value pairs.

view this post on Zulip Richard Feldman (Mar 09 2022 at 01:52):

interesting! To Brendan's point earlier, I wonder how they store deleted entries :thinking:

view this post on Zulip Richard Feldman (Mar 09 2022 at 01:54):

clearly they aren't just iterating directly over the k/v pair array, since deletions may have removed entries :big_smile:

view this post on Zulip Richard Feldman (Mar 09 2022 at 01:56):

actually come to think of it, we do already have an optimization that stores tag union discriminants in pointer bits if possible

view this post on Zulip Richard Feldman (Mar 09 2022 at 01:56):

so if you had a key or value that had a pointer in it, I think just wrapping it in [ Entry key value, Removed ] would actually not take up any more space

view this post on Zulip Richard Feldman (Mar 09 2022 at 01:57):

but it would need extra space if both key and value were integers, for example

view this post on Zulip Richard Feldman (Mar 09 2022 at 01:57):

a hash table of indices into an array of key/value pairs

double pointer dereference for each lookup is not ideal performance-wise :sweat_smile:

view this post on Zulip Derek Gustafson (Mar 09 2022 at 02:00):

Considering what goes into executing a single bytecode in Python, I don't think double dereferncing is really a concern

view this post on Zulip Richard Feldman (Mar 09 2022 at 02:03):

I wonder what we'd have to do to avoid that

view this post on Zulip Richard Feldman (Mar 09 2022 at 02:04):

while still being able to iterate in insertion order

view this post on Zulip Derek Gustafson (Mar 09 2022 at 02:06):

Just dug deeper. The use Null pointers for deleted entries, and skip over them when iterating.

view this post on Zulip Derek Gustafson (Mar 09 2022 at 02:09):

https://docs.oracle.com/javase/8/docs/api/java/util/LinkedHashMap.html

view this post on Zulip Derek Gustafson (Mar 09 2022 at 02:09):

Hash map with a doubly linked list for ordering

view this post on Zulip Derek Gustafson (Mar 09 2022 at 02:09):

Roc can't handle a doubly linked list, can it?

view this post on Zulip Richard Feldman (Mar 09 2022 at 02:17):

nope haha

view this post on Zulip Richard Feldman (Mar 09 2022 at 02:17):

not expressible!

view this post on Zulip Richard Feldman (Mar 09 2022 at 02:48):

another important consideration about AssocList and friends: a common use for dictionaries is to have string keys

view this post on Zulip Richard Feldman (Mar 09 2022 at 02:49):

if they're small strings, that might not be so bad, but if they're large strings (e.g. URLs and email addresses are often over 24B) then there'll be lots of pointer dereferencing just to complete the equality checks :grimacing:

view this post on Zulip Richard Feldman (Mar 09 2022 at 02:49):

a plain hashmap would do much better than those!

view this post on Zulip Derek Gustafson (Mar 09 2022 at 02:50):

Can we do anything with string interning?

view this post on Zulip Richard Feldman (Mar 09 2022 at 02:53):

:thinking: we've never talked about it

view this post on Zulip Richard Feldman (Mar 09 2022 at 02:54):

although given the complexity that would introduce to a lot of other use cases, I'd default to thinking that string keys are a point in favor of an ordinary hashmap :big_smile:

view this post on Zulip Derek Gustafson (Mar 09 2022 at 02:58):

So, back to what started this whole discussion today: Should there be a toList function? If so, what's it's order?

view this post on Zulip Brendan Hansknecht (Mar 09 2022 at 03:08):

I would vote for yes and random order, but in general, it is something that can be worried about later. Because much more important of the library can not be built in roc and is written is zig or directly in ir.

view this post on Zulip Brendan Hansknecht (Mar 09 2022 at 03:09):

interesting! To Brendan's point earlier, I wonder how they store deleted entries :thinking:

Separate list that has a bit to spare and helps make lookups faster by reducing pointer dereferencing.

view this post on Zulip Richard Feldman (Mar 09 2022 at 03:09):

I assume we're talking about pure Roc implementations here, yeah?

view this post on Zulip Derek Gustafson (Mar 09 2022 at 03:10):

That's what I've been talking about

view this post on Zulip Brendan Hansknecht (Mar 09 2022 at 03:11):

For the deleted entries comment, I am talking about absl::flat_hash_map

view this post on Zulip Brendan Hansknecht (Mar 09 2022 at 03:11):

Essentially whenever I talk about specific implementation I tend to base off of absl::flat_hash_map, but when I talk about design, I am talking about the theoretical roc version.

view this post on Zulip Richard Feldman (Mar 09 2022 at 03:12):

personally my current thinking on toList is:

view this post on Zulip Richard Feldman (Mar 09 2022 at 03:13):

and because of that last point, the advice would be "try not to rely on the ordering, because if you do, it'll make upgrading painful in the future"

view this post on Zulip Brendan Hansknecht (Mar 09 2022 at 03:13):

Also, my biggest issue with the python dict is that there hash functions are terrible which naturally leads to more collisions which slows things down even more.

view this post on Zulip Richard Feldman (Mar 09 2022 at 03:22):

relatedly: we should look into porting a hashing algorithm to Roc!

view this post on Zulip Richard Feldman (Mar 09 2022 at 03:22):

that may reveal missing builtins

view this post on Zulip Richard Feldman (Mar 09 2022 at 03:23):

e.g. I noticed wyhash calls count_ones, which we don't currently have a builtin for

view this post on Zulip Jared Cone (Mar 09 2022 at 03:25):

Would any runtime operations on the dict change the ordering? Like, dict with (a,b) toList produces [a,b]. But then you add c and now toList produces [b,a,c]

view this post on Zulip Brendan Hansknecht (Mar 09 2022 at 03:25):

Yes

view this post on Zulip Brendan Hansknecht (Mar 09 2022 at 03:25):

A lot of dicts change order when they resize

view this post on Zulip Brendan Hansknecht (Mar 09 2022 at 03:25):

Because they rehash and modulus everything.

view this post on Zulip Jared Cone (Mar 09 2022 at 03:33):

Yea I would think that would be the more prevalent issue that people would run into as opposed to depending on ordering being stable between versions

view this post on Zulip Jared Cone (Mar 09 2022 at 03:35):

At least I know my team has had bugs with that

view this post on Zulip Brendan Hansknecht (Mar 09 2022 at 03:38):

This makes me think of how for debug builds and tests, google now randomize the order of elements that have the same value when sorting. It stops people from depending on it and enables them to upgrade the algorithm. Of course, you can specifically call a stable sort method if you need that behavior, but if you don't specify needing the behavior, they don't want you to depend on implementation details of the sort.

view this post on Zulip Brendan Hansknecht (Mar 09 2022 at 03:39):

I really wish this was possible to do in roc, but the only way to do it would be to take a random seed...which easily could be set to a non-random value.

view this post on Zulip Robin Heggelund Hansen (Mar 09 2022 at 08:04):

There's much to reply to here, I'll do my best to catch everything.

RRBT, HAMTs and AMT

RRBT is not a hash-map. It's an Array implementation. It's benefits over AMT is that it retains amortized constant time operations for operations that manipulate the left/front of the Array as well as concatination. An AMT implementation (like the one in Elm) effectively needs rebuilding the tree for such operations. It does come at a slight bit of overhead though, but it shouldn't be to terrible.

While RRBT, HAMT and AMT are today known as immutable collections, they weren't designed that way from the offset. Phillip Bagwell (inventor of HAMT and co-inventor of RRBT) initially designed the HAMT as a mutate-in-place datastructure which used a tree structure for easier scalability (didn't require rebalancing, and so didn't have the same worst-time memory requirements).

Besides, any batch operations on such data structures (like map, or batch insert) would take advantage of mutability like what you've got in Roc.

I already made an Elm HashMap implementation, based on HAMT but which retains insertion order: https://github.com/elm-explorations/hashmap

The extra book-keeping to retain insertion order has some overhead, but it's still faster than Elm's Left Leaning RRBT.

Python's insert-order dict

Python retains insertion order at no performance expense _compared to Python's earlier implementation_, which is an important point to understand.

The way deletions are implemented is that you insert a tombstone for a particular key. A tombstone is just an object that can easily be recognized as a tombstone, and whenever a tombstone is encountered, the implementation acts if it found no value. Easy to do without performance overhead in a dynamic language, harder in ML. Though I guess Roc's tags might help with that?

Anyway, at some point, the hashmap has to be rebuilt (either to accommodate more keys, or simply to prevent an overflow of tombstones), at which point the tombstones are filtered out. A regular HAMT, in comparison, doesn't require rebuilding at set intervals.

The same idea is used in my Elm-based hashmap implementation, link above.

Hashes and orderings

I belive the concern against relying on hashed-ordering is that it could change from one release of the std library to the next. I'd just like to say that this happens _very_ rarely in practice. Most languages don't change the hash that often, and new hashes don't show up that often either. It's an option to simply commit to a given hash-implementation, which many languages (like Java) do. Although, most languages also allow you to change the hash implementation being used :shrug:‍♂️

Efficient ordering-based map

The consensus right now is that you can't really do much better than an RRBT. _However_ in a language like Roc where you're not storing everything as a pointer, it might be worth considering just going for a 2-3 or 2-3-4 tree. It's more code, but less complexity, and you should be able to store 2-3 non-heap key-value pairs per node if you've got monomorphization. Doubt it'll be faster than a HAMT, though.

view this post on Zulip Kevin Gillette (Mar 09 2022 at 14:01):

@Richard Feldman I wonder if the naming of FnvMap and WyMap run counter to the goal of a high level language with tight performance: in my mind, that means I write very high level code and it happens to perform quite well (not necessarily "fastest ever," but fast enough that performance would not be a factor in choosing against Roc except in very niche cases). Even as a seasoned engineer, seeing FnvMap and WyMap as the choices means I need to read wikipedia and blog posts specifically comparing the two before I can make a decision, but I'd much rather just have some expert pick the generally right choice, give it a general name, and provide some loose guidance on "if your workload looks like X then consider choosing Y instead."

Beginner-friendliness might be something a clean language design provides "for free" without necessarily being an explicit goal, but echoing @Brendan Hansknecht, I believe that lacking a clear, good enough default for application authors to choose would make the language less usable and less approachable.

fwiw, having some kind of key-val mapping concept (whether or not it's a hashmap) is a fundamental need in programming, and a language feels hollow (to me) if it doesn't have a great implementation in the stdlib if not the language itself. Yeah, we should encourage people to use records when records fit the need, which is most of the time, but there's already precedence for a "here be dragons" syntax with sets in Roc, which use a list-like but not list-identical literal syntax. I'm certainly open to the argument that "any use of map literals should be using records instead," but there are regularly cases where one needs to initialize a mapping with a mix of static and dynamic values: requiring that to be a sequence of Map.add calls can obfuscate behind an algorithm what could have been data-declarative code.

view this post on Zulip Kevin Gillette (Mar 09 2022 at 14:29):

Does referential transparency mean "across time and space," or is that a related but distinct concept? I've taken it to just mean within the confines of a single process in practice, and that behavioral changes within different processes of the same binary, or across different binaries, while not ideal, do not technically break the property.

We could have platforms, such as an HTTP server platform, pass an opaque request context type to handler functions, such that the opaque type could be exchanged for a seed (thus not relying on the app author to generate their own seed, though still relying on them to not pass 0 as the seed). We could also have the platform provide a higher order data-initializer function(s), like:

init : PlatformOpaque -> (Seed -> a) -> a

Called like pf.init req Hash.new where req is the opaque type given by the platform which internally contains a seed.

In terms of flakey tests vs random seeds, we could have some test running mode that knows about seed-taking (or implicitly seed) types, and either runs it with a fixed seed for debugging, or runs the test (or relevant portions of the test, because referential transparency!) multiple times, or using test-compiled instrumentation (as context-aware fuzzers do) to ensure that those flakes are exposed on every test invocation rather than only occasionally. Combined with test result caching (don't rerun the test if nothing about the code has changed), this seems pretty solvable.

In Go's case, it appears to work pretty well, since the community doesn't regularly complain about flakey tests due to map iteration order, suggesting that issues are reliably uncovered very quickly (I've been in the Go community for over 10 years, so that's not a from-the-outside-in observation). I suspect giving Roc mappings an implicit seed would not lead to flakey tests in practice, especially with some degree of test support. That would also save us from needing to choose data structures or algorithms that guarantee insertion-order iteration or some other meaningful order.

view this post on Zulip Brendan Hansknecht (Mar 09 2022 at 16:46):

I wonder if the naming of FnvMap and WyMap run counter to the goal of a high level language with tight performance: in my mind

I think that naming the map is much worse than that. To maximize performance you likely need multiple types of maps as well. So it would be WyFlatMap and FvnFlatMap and WyNodeMap and ....

I believe that lacking a clear, good enough default for application authors to choose would make the language less usable and less approachable

I 100% agree. Even though I am advocating for multiple map implementations, I think that basically all users should always be using something that is similar to absl::flat_hash_map. I just think that if it is written in Roc, we should definitely document alternatives/cases where you would want to pick a different map or hash function. I think that 95+% of users would just use the default map and be happy, but when someone hits and edge case, it tends to really matter.

there's already precedence for a "here be dragons" syntax with sets in Roc

Yeah, initialization is definitely nicer with full builtins. Though I guess it isn't much worse than Dict.fromList [ T k1 v1, T k2 v2, T k3 v3 ] or Set.fromList [ k1, k2, k3 ]

requiring that to be a sequence of Map.add calls can obfuscate behind an algorithm what could have been data-declarative code

100% Agree though Map.add is not totally horrible:

    dict = Map.empty {}
        |> Map.add "test" 124
        |> Map.add "foo" 7
        |> Map.add "bar" 12

within the confines of a single process

This is what I have seen most hash maps do. It requires special fingerprinting algorithms if you need to read and write to disk.

and that behavioral changes within different processes of the same binary, or across different binaries, while not ideal, do not technically break the property.

This is a hard one. I think it is technically true, but it is hard for a lot of people when expecting certain things from a pure functional language. They generally expect if they compile the same app twice with the same version of compiler and libraries, they will get the same result. That is no longer true.

In terms of flakey tests vs random seeds ... runs it with a fixed seed for debugging

I would highly advise against a fix seed. People will just be confuse why their test passes but their actual code fails.

In Go's case, it appears to work pretty well, since the community doesn't regularly complain about flakey tests due to map iteration order, suggesting that issues are reliably uncovered very quickly (I've been in the Go community for over 10 years, so that's not a from-the-outside-in observation). I suspect giving Roc mappings an implicit seed would not lead to flakey tests in practice, especially with some degree of test support.

I used to use Go a lot, so I totally understand what you mean. I think that if Roc always gave random seed to maps and it was well known, that would fix the issues. Of course, it will probably also anger a lot of functional programmers because it feels very impure. I think it is just important that a implicit seed would need to be random and should not be some sort of static constant. Also, the reason that a random seed doesn't lead to flaky tests is that it is extremely unlikely to produce the same hardcode result twice. So the test will fail way more than it succeeds and stick out like a sore thumb. As an extra bonus, random seeds help to mitigate hash flooding.

We could also have the platform provide a higher order data-initializer function

Jumping back to this point now because I think it might have more complexities to talk about. A few first thoughts:

As an extra note, we could theoretically have a Roc constant like Num.appSeed : Seed. Where Seed := Nat and the seed only has a get method. Then anything that just needs a random initializer could take in Seed and use it. The appSeed is a constant that is random but generated at compile time (or maybe app startup time). For anything that needs more randomness, it then would still depend on getting it from the platform.

view this post on Zulip Kevin Gillette (Mar 09 2022 at 19:29):

I definitely agree that there is room and use for multiple map types, but we should be able to engineer a pretty good default choice, as you say, for the 95% case.

I would highly advise against a fix seed. People will just be confuse why their test passes but their actual code fails.

I meant as an opt-in debugging mode, not the default.

If Seed is defined in the hash map library (or rand or etc) and is an opaque type, the platform would have no way to construct a seed that wasn't also available to the user. So if the platform is really just calling Map.new (Seed.from 1234). The end user can also make the exact same call and remove the randomness.

I was actually thinking Seed might just be another name for U64 (or U64 itself), and transparent, but that the platform would provide such a convenient way to get a seed that you'd just use the platform.

view this post on Zulip Brendan Hansknecht (Mar 09 2022 at 22:48):

I was actually thinking Seed might just be another name for U64 (or U64 itself), and transparent, but that the platform would provide such a convenient way to get a seed that you'd just use the platform.

I see. I guess that also works. It just means that anyone can pass a constant seed to the mapand then expect it to always have the same order, but if we update the hash map in some way, that may not be true. But if it was codified in most platforms, people should definitely see it coming when it randomly breaks if they use a constant seed over the platform generated one.

view this post on Zulip Richard Feldman (Mar 10 2022 at 03:43):

https://docs.rs/indexmap/latest/indexmap/map/struct.IndexMap.html looks very promising! I wonder what the internal representation looks like

view this post on Zulip Richard Feldman (Mar 10 2022 at 03:44):

the swap_remove is very clever

view this post on Zulip Brendan Hansknecht (Mar 10 2022 at 03:48):

It looks to be a hashmap to int (index) and a vector of key-value pairs.

view this post on Zulip Brendan Hansknecht (Mar 10 2022 at 03:49):

Otherwise, it looks to use hashbrown internally, so I think that it just adds an extra layer of indirection to make the ordering work, but this is on a super quick skim and could be off.

view this post on Zulip Brendan Hansknecht (Mar 10 2022 at 03:50):

Internal data structure

view this post on Zulip Robin Heggelund Hansen (Mar 10 2022 at 12:50):

yikes. Under the "efficient ordering-based dict" i mention RRBT. I meant Red Black Tree :sweat_smile:

view this post on Zulip Robin Heggelund Hansen (Mar 10 2022 at 12:51):

left leaning RRBT should be Left Leaning Red Black Tree as well :sweat_smile:

view this post on Zulip Richard Feldman (Mar 11 2022 at 19:25):

here's a proposal for builtin Dict and Set types which would remember insertion order: https://docs.google.com/document/d/1RtffKIwVZut4g8hniop8veuMeQYfYeGZofT-mXvrQZQ/edit?usp=sharing

feedback welcome!

view this post on Zulip Brendan Hansknecht (Mar 11 2022 at 19:36):

Random thought. With abilities, we could make Dict.* work on any dictionary, including those built in roc, right?

That could greatly streamline use if we can make other Dicts a drop in replacement with the exception of a few functions. Though I guess you could also just use an alias so maybe that doesn't matter too much.

view this post on Zulip Martin Stewart (Mar 11 2022 at 20:05):

Richard Feldman said:

here's a proposal for builtin Dict and Set types which would remember insertion order: https://docs.google.com/document/d/1RtffKIwVZut4g8hniop8veuMeQYfYeGZofT-mXvrQZQ/edit?usp=sharing

feedback welcome!

If the Dict preserves insertion order then dict1 == dict2 will be false if dict1 and dict2 to have the same items but in a different order? If yes, that seems like a bit of a gotcha. That said, I've used AssocList a lot in Elm which has the same issue and in practice it hasn't been too much of an issue.

view this post on Zulip Brendan Hansknecht (Mar 11 2022 at 20:08):

That is probably a much worse issue with sets

view this post on Zulip Brendan Hansknecht (Mar 11 2022 at 20:08):

hmm... good gotcha to know about

view this post on Zulip Richard Feldman (Mar 11 2022 at 20:13):

dict1 == dict2 should determine its answer independently of insertion order, meaning it would likely be pretty slow if both dictionaries had the same length :stuck_out_tongue:

(that siad, dict1 == dict2 is not something I expet people to be be doing often outside tests!)

view this post on Zulip Richard Feldman (Mar 11 2022 at 20:13):

but that's true of hashmaps in general because hash collision resolution depends on insertion order

view this post on Zulip Richard Feldman (Mar 11 2022 at 20:14):

so that question is decoupled from this proposal I think!

view this post on Zulip Martin Stewart (Mar 11 2022 at 20:18):

I think it's typical to expect that if dict1 == dict2 is true, then someFunction dict1 == someFunction dict2 is also true? If Dict.toList sorts based on hash then this invariant would indeed hold.

view this post on Zulip Richard Feldman (Mar 11 2022 at 20:19):

huh, that's an interesting point!

view this post on Zulip Richard Feldman (Mar 11 2022 at 20:19):

I hadn't considered that

view this post on Zulip Richard Feldman (Mar 11 2022 at 20:20):

I agree that it makes intuitive sense

view this post on Zulip Richard Feldman (Mar 11 2022 at 20:20):

however, I could see it being an inconvenience for tests

view this post on Zulip Richard Feldman (Mar 11 2022 at 20:21):

e.g. I want to say user1 == user2 in my test, and because User has a dictionary in it somewhere, and they were both built up differently, now I have to make my test dramatically different :grimacing:

view this post on Zulip Richard Feldman (Mar 11 2022 at 20:22):

:thinking: is (val1 == val2) == (someFunction val1 == someFunction val2) otherwise true in Roc for all other types besides dictionaries?

view this post on Zulip Richard Feldman (Mar 11 2022 at 20:22):

floats come to mind, but let's assume NaN is not a consideration here

view this post on Zulip Martin Stewart (Mar 11 2022 at 20:23):

I think it holds for everything else. It's almost synonymous with referential equality?

view this post on Zulip Richard Feldman (Mar 11 2022 at 20:25):

heh, I guess an upside of making dict1 == dict2 compare based on insertion order would be that it would run faster :big_smile:

view this post on Zulip Richard Feldman (Mar 11 2022 at 20:26):

could always add a Dict.sameContents : Dict k v, Dict k v -> Bool which ignores insertion order, although it probably wouldn't be as convenient in tests

view this post on Zulip Richard Feldman (Mar 11 2022 at 20:29):

separate question: what if the names of the two remove functions were remove and removeOrdered, thus nudging you to use the faster one by default?

view this post on Zulip Richard Feldman (Mar 11 2022 at 20:30):

I think in the uncommon case where people really cared about order, someone could make a third-party package called OrderedDict which was an opaque wrapper around Dict except its remove function called removeOrdered under the hood

view this post on Zulip Richard Feldman (Mar 11 2022 at 20:30):

so that you'd have the guarantee that it never disturbed the order even accidentally, because the original remove was no longer in the API

view this post on Zulip Richard Feldman (Mar 11 2022 at 20:30):

and there'd be no runtime overhead since it would be an opaque wrapper around the original

view this post on Zulip Brendan Hansknecht (Mar 11 2022 at 20:34):

removeOrdered doesn't sound like it will necessarily take a long time, but I definitely think removeSwap should just be remove

view this post on Zulip Richard Feldman (Mar 11 2022 at 20:40):

removeShift maybe?

view this post on Zulip Richard Feldman (Mar 11 2022 at 20:46):

I think it holds for everything else. It's almost synonymous with referential equality?

worth noting, we can't guarantee this across the language. People can absolutely make custom equality implementations for their custom collections which don't obey this.

view this post on Zulip Richard Feldman (Mar 11 2022 at 20:46):

so no matter what, it's not safe for people to rely on it in the general case

view this post on Zulip Brendan Hansknecht (Mar 11 2022 at 20:53):

If a user makes a custom equality method, would it work with ==

view this post on Zulip Richard Feldman (Mar 11 2022 at 20:54):

yep!

view this post on Zulip Richard Feldman (Mar 11 2022 at 20:55):

== desugars to isEq

view this post on Zulip Derek Gustafson (Mar 11 2022 at 20:55):

Isn't that the point of the Eq Ability?

view this post on Zulip Richard Feldman (Mar 11 2022 at 20:55):

yeah :check:

view this post on Zulip Martin Stewart (Mar 11 2022 at 20:59):

Oh, well if there's custom equality then
Richard Feldman said:

:thinking: is (val1 == val2) == (someFunction val1 == someFunction val2) otherwise true in Roc for all other types besides dictionaries?

doesn't hold

view this post on Zulip Richard Feldman (Mar 11 2022 at 21:02):

well it's true for other types in the stdlib

view this post on Zulip Richard Feldman (Mar 11 2022 at 21:02):

and seems like something that probably ought to be encouraged, even if it can't be a guarantee

view this post on Zulip Richard Feldman (Mar 11 2022 at 21:02):

another thing that occurs to me is that if == checks for contents only, it could lead to bugs in tests

view this post on Zulip Richard Feldman (Mar 11 2022 at 21:02):

or rather, bugs getting missed by tests

view this post on Zulip Richard Feldman (Mar 11 2022 at 21:03):

because maybe you have code that relies on the ordering, and the test says "oh yeah these are ==, everything works fine" and then the ordering changes and the test misses it, and something breaks elsewhere

view this post on Zulip Derek Gustafson (Mar 11 2022 at 22:52):

Richard Feldman said:

here's a proposal for builtin Dict and Set types which would remember insertion order: https://docs.google.com/document/d/1RtffKIwVZut4g8hniop8veuMeQYfYeGZofT-mXvrQZQ/edit?usp=sharing

feedback welcome!

Just finished reading this. I like the idea of using IndexMap.

view this post on Zulip Derek Gustafson (Mar 11 2022 at 22:52):

But, I question the assumption that it will be significantly faster to be written in Rust

view this post on Zulip Derek Gustafson (Mar 11 2022 at 22:54):

With what @Brendan Hansknecht is already able to do, performance wise, for a hash function. And the light it shines on where we can improve the optimizations, I think there's a strong chance that a Roc based hashmap could be competitive.

view this post on Zulip Richard Feldman (Mar 11 2022 at 22:58):

I question the assumption that it will be significantly faster to be written in Rust

you mean as opposed to having it be written in pure Roc?

view this post on Zulip Derek Gustafson (Mar 11 2022 at 22:58):

Yes

view this post on Zulip Richard Feldman (Mar 11 2022 at 22:58):

it's conceivable, but if it's a builtin...why not? :big_smile:

view this post on Zulip Richard Feldman (Mar 11 2022 at 22:58):

nobody will be able to tell the difference anyway!

view this post on Zulip Derek Gustafson (Mar 11 2022 at 23:04):

From a practical point of view, you're right.
From a philisophical point of view, I feel like being able to say "Our standard library is written in out language" is a selling point

view this post on Zulip Derek Gustafson (Mar 11 2022 at 23:05):

I like to be able to look at a language's standard library to get an idea of what code should look like

view this post on Zulip Richard Feldman (Mar 11 2022 at 23:05):

I hear that, although I made a decision awhile ago that when the choice is between the stdlib going faster and being in Roc, the priority is for it to go faster :big_smile:

view this post on Zulip Derek Gustafson (Mar 11 2022 at 23:06):

Okay

view this post on Zulip Kevin Gillette (Mar 12 2022 at 04:50):

Martin Stewart said:

If the Dict preserves insertion order then dict1 == dict2 will be false if dict1 and dict2 to have the same items but in a different order? If yes, that seems like a bit of a gotcha. That said, I've used AssocList a lot in Elm which has the same issue and in practice it hasn't been too much of an issue.

This could be pretty surprising. I would indeed expect that if dict1 == dict2, I would expect to be able to get the same iteration order out of either (thus they really _should_ be equal only if they have the same iteration order), yet it's equally surprising for the iteration order of a dictionary to have any bearing on equality.

I suppose this is one of the reasons that so many languages define dictionaries to have undefined or otherwise unordered iteration (so as to not commit to it being a property influencing equality), yet I also see how this is particularly important for FP.

view this post on Zulip Kevin Gillette (Mar 12 2022 at 04:51):

Richard Feldman said:

(that siad, dict1 == dict2 is not something I expet people to be be doing often outside tests!)

Are we sure dict equality won't be used outside of tests?

Anything for which equality is defined (which the language sufficiently understands) _can_ be used as mapping key (if every value has a single canonical representation, I suspect it is also hashable). This means that dictionaries and sets can themselves be keys (or set elements). This could be important to generic memoization, where the input may be a dictionary or set.

view this post on Zulip Kevin Gillette (Mar 12 2022 at 05:32):

Richard Feldman said:

dict1 == dict2 should determine its answer independently of insertion order, meaning it would likely be pretty slow if both dictionaries had the same length :stuck_out_tongue:

iiuc, when two dictionaries share the same hash function (and perhaps same number of buckets/nodes), you can do a bucket-wise comparison: if any bucket has a different length than its pair in the other dictionary, then the two dictionaries are not-equal. If any kind deterministic ordering can be maintained between elements in a bucket, then elements within paired buckets can be pairwise-compared efficiently, and even in parallel across buckets. Maintaining intra-bucket ordering would be a heavier cost during insert, but could unlock efficient whole-of-set operations such as intersection, difference, etc.

view this post on Zulip Kevin Gillette (Mar 12 2022 at 05:37):

Richard Feldman said:

separate question: what if the names of the two remove functions were remove and removeOrdered, thus nudging you to use the faster one by default?

I think in the uncommon case where people really cared about order, someone could make a third-party package called OrderedDict which was an opaque wrapper around Dict except its remove function called removeOrdered under the hood

That seems apt. There are certainly use-cases for specifically ordered dictionaries, but ordering for a hypothetical Roc dictionary just seems to be an incidental outcome of seeking determinism.

view this post on Zulip Brendan Hansknecht (Mar 12 2022 at 06:00):

If two dictionaries share the same implementation, hash function, and capacity, you should be able to do bucket wise comparison. That is a lot of assumption though.

Also, in some maps it may not be possible to get an individual bucket. absl::flat_hash_map, for example, can not tell one bucket from the next without extra calculations in many cases. So you will get a group of buckets with interleaved elements as opposed to a single bucket.

view this post on Zulip Brendan Hansknecht (Mar 12 2022 at 06:03):

Could theoretically get the entire map in one chunk as if it was a single bucket, but that is mathematically unlikely.

view this post on Zulip Brendan Hansknecht (Mar 12 2022 at 06:03):

Most modern fast hash maps are quite similar internally and would see similar issues.

view this post on Zulip Brendan Hansknecht (Mar 12 2022 at 06:09):

I think that if equality doesn't depend on insertion order, we should not make the iteration depend on insertion order. Whatever we pick here, I feel it would be best to be consistent.

view this post on Zulip Brendan Hansknecht (Mar 12 2022 at 06:23):

As an extra note, even haskell allows for dictionaries with undefined iteration order.

Sure someone could use one to generate a seed for a random number generator, but I don't think the goal should be to develop for the pessimistic case where the user is trying to game the language.

view this post on Zulip Richard Feldman (Mar 12 2022 at 17:48):

I wonder why IndexMap stores the actual hash value alongside each entry :thinking:

https://github.com/bluss/indexmap/blob/921df7b9ecdaa15203180d2892064c21c62db01a/src/lib.rs#L129

view this post on Zulip Richard Feldman (Mar 12 2022 at 17:49):

if it didn't, then toList could be essentially free

view this post on Zulip Richard Feldman (Mar 12 2022 at 17:49):

because you'd already have all the key-value pairs in a contiguous chunk of heap memory

view this post on Zulip Brendan Hansknecht (Mar 12 2022 at 17:59):

It looks like they use it to update the indices hash table without having to recompute the hash.

view this post on Zulip Brendan Hansknecht (Mar 12 2022 at 18:00):

Probably only really important if you are worried about large keys

view this post on Zulip Brendan Hansknecht (Mar 12 2022 at 18:00):

But also, I guess they do update the indices on every remove

view this post on Zulip Richard Feldman (Mar 12 2022 at 18:14):

ah interesting

view this post on Zulip Richard Feldman (Mar 12 2022 at 18:15):

so you could also drop the field it altogether, with the tradeoff being:

view this post on Zulip Brendan Hansknecht (Mar 12 2022 at 18:36):

And removeCompact has to rehash a lot of things.

view this post on Zulip Richard Feldman (Mar 12 2022 at 18:36):

yikes, that's true

view this post on Zulip Brendan Hansknecht (Mar 12 2022 at 18:37):

Which is super fast in the case of small keys and probably not a big deal, but really bad for large keys.

view this post on Zulip Richard Feldman (Mar 12 2022 at 21:22):

here's an idea for a modification of the IndexMap design: instead of having remove be "remove and swap," instead have it be "remove by leaving a tombstone"

view this post on Zulip Richard Feldman (Mar 12 2022 at 21:26):

I think that would have these tradeoffs:

view this post on Zulip Richard Feldman (Mar 12 2022 at 21:27):

having remove preserve order would definitely make it less surprising to use, which I like, and having it be a bit faster seems nice too

view this post on Zulip Richard Feldman (Mar 12 2022 at 21:27):

assuming that seems like a good idea, it would raise the question of how to represent a tombstone

view this post on Zulip Richard Feldman (Mar 12 2022 at 21:28):

one idea is to use the HashValue that's already stored next to the key and the value

view this post on Zulip Richard Feldman (Mar 12 2022 at 21:29):

have a HashValue of 0 represent a tombstone, which means we have to disallow an actual hash value of 0. That can be done cheaply by using (hash + (hash == 0)) instead of hash whenever we compute a hash value

view this post on Zulip Richard Feldman (Mar 12 2022 at 21:29):

so that would mean an extremely slight increase in chance of collisions, because anything that hashed to 0 would instead hash to 1

view this post on Zulip Richard Feldman (Mar 12 2022 at 21:30):

but it would only cost an addition and a compare to zero, and would only come up when actually computing a hash - so basically insert, get, and remove only

view this post on Zulip Richard Feldman (Mar 12 2022 at 21:31):

really the thing that I like most about it is that it seems like having remove not preserve order when literally every other operation does seems like a thing that would be a source of confusion

view this post on Zulip Richard Feldman (Mar 12 2022 at 21:31):

maybe bugs, maybe not, but certainly a lot of "wait why is it that way?" questions

view this post on Zulip Richard Feldman (Mar 12 2022 at 21:32):

seems like it would just feel a lot nicer to use if it Just Worked while preserving both order and asymptotics

view this post on Zulip Richard Feldman (Mar 12 2022 at 21:32):

plus then there'd be no real need for a separate OrderedDict wrapper data structure, since just using it normally would get you a completely ordered one

view this post on Zulip Richard Feldman (Mar 12 2022 at 21:33):

also wouldn't make it as weird to explain why equality preserved order if the design of remove (assuming it was essentially removeSwap) encourages that you treat it like an unordered map even though it's ordered

view this post on Zulip Brendan Hansknecht (Mar 12 2022 at 21:37):

Richard Feldman said:

have a HashValue of 0 represent a tombstone, which means we have to disallow an actual hash value of 0. That can be done cheaply by using (hash + (hash == 0)) instead of hash whenever we compute a hash value

There are hash functions that already have guarantee like this so you don't need to explicitly deal with it. I think wyhash might have a tombstone or 2. If not slightly modified versions of it should.

view this post on Zulip Richard Feldman (Mar 12 2022 at 21:38):

ooo, even better!

view this post on Zulip Brendan Hansknecht (Mar 12 2022 at 21:40):

So with this change, wouldn't every insert be O(n) to scan the list for tombstone? If not, tombstones will just collect and we will waste tons of memory, right?

view this post on Zulip Richard Feldman (Mar 12 2022 at 21:41):

this is just in the ordering array

view this post on Zulip Richard Feldman (Mar 12 2022 at 21:41):

oh, sorry

view this post on Zulip Richard Feldman (Mar 12 2022 at 21:41):

If not, tombstones will just collect and we will waste tons of memory, right?

my thinking is that they'd collect, but as soon as we run out of capacity and need to grow the array, we'd compact them when copying them into the new allocation

view this post on Zulip Richard Feldman (Mar 12 2022 at 21:42):

so that might happen more often than necessary, but that only happens if you're doing a bunch of removals followed by a bunch of insertions. If you know that's your use case, you can explicitly opt into removeSwap if you don't care about order, or if you do care about order, this is still better than removeShift because the shifting only happens occasionally instead of every single time

view this post on Zulip Richard Feldman (Mar 12 2022 at 21:44):

we could also have a heuristic like "if we only used X% of our capacity before running out, just compact in-place and don't make a new allocation"

view this post on Zulip Richard Feldman (Mar 12 2022 at 21:45):

so in that case it would be sort of like "lazily removeShift when we run out of space only"

view this post on Zulip Richard Feldman (Mar 12 2022 at 21:46):

although that might be a rare enough situation that maybe it's not worth storing the extra metadata to make it happen :thinking:

view this post on Zulip Richard Feldman (Mar 12 2022 at 21:46):

and instead just suggest using removeShift explicitly

view this post on Zulip Richard Feldman (Mar 12 2022 at 21:52):

basically it seems like it would fit the "good default choice for a dictionary" criteria

view this post on Zulip Richard Feldman (Mar 12 2022 at 21:53):

if you end up with tombstones taking up too much memory, the performance optimization to fix it is easy: change from remove to either removeSwap or removeShift depending on whether you care about ordering

view this post on Zulip Brendan Hansknecht (Mar 12 2022 at 21:59):

my thinking is that they'd collect, but as soon as we run out of capacity and need to grow the array, we'd compact them when copying them into the new allocation

The index map array is not tied to any capacity. Though I guess you could tie it to the capacity of the index hashmap.

view this post on Zulip Brendan Hansknecht (Mar 12 2022 at 22:00):

Also, this will make iteration take O(capacity) time.

view this post on Zulip Richard Feldman (Mar 12 2022 at 22:01):

right, same as an unordered hashmap :thumbs_up:

view this post on Zulip Richard Feldman (Mar 12 2022 at 22:02):

which seems fine to me

view this post on Zulip Brendan Hansknecht (Mar 12 2022 at 22:02):

depends the implementation, but yes, that is generally the case.

view this post on Zulip Richard Feldman (Mar 12 2022 at 22:02):

although actually in practice it will most likely have to traverse less than an unordered hashmap

view this post on Zulip Richard Feldman (Mar 12 2022 at 22:02):

maybe a better way to say that: worst-case is same as unordered hashmap, but typical case is likely to be better :big_smile:

view this post on Zulip Richard Feldman (Mar 12 2022 at 22:04):

separately, I realized since we monomorphize, we could actually completely swap out the implementation for certain cases where we know it can be a lot faster, e.g. when the keys are integers

view this post on Zulip Brendan Hansknecht (Mar 12 2022 at 22:06):

One other thing I realized that might be the worst case for this change. The size of the data array is not at all tied to the capacity. If I delete and re-add the same element 1,000,000 times. I would get 1,000,000 tombstones but the capacity would never change of the index map.

view this post on Zulip Richard Feldman (Mar 12 2022 at 22:07):

oh, so I'm assuming that we have the same capacity value for both the indices and the data array

view this post on Zulip Richard Feldman (Mar 12 2022 at 22:08):

in fact we'd probably want to have one heap allocation for both, just partition it into (refcount, data array, indices)

view this post on Zulip Richard Feldman (Mar 12 2022 at 22:10):

but I think the real key here is that if you're doing a million removals and a million re-adds of the same element, it's probably time to consider using either removeSwap or removeShift instead of remove :laughing:

view this post on Zulip Brendan Hansknecht (Mar 12 2022 at 22:11):

hmm. If you did that, I guess that means you would get a lot of compactions. It would fill up the capacity of the data array with tombstones then compact and then repeat. So would just group compaction to whenever the capacity was full of tombstones.

view this post on Zulip Richard Feldman (Mar 12 2022 at 22:11):

right

view this post on Zulip Brendan Hansknecht (Mar 12 2022 at 22:11):

Also, removing and reinserting an element many times is not at all uncommon in hot loops of code.

view this post on Zulip Richard Feldman (Mar 12 2022 at 22:11):

that might actually be better than removeShift haha :sweat_smile:

view this post on Zulip Brendan Hansknecht (Mar 12 2022 at 22:12):

Yeah, it should be a grouped removeShift

view this post on Zulip Richard Feldman (Mar 12 2022 at 22:12):

well, if you're removing and re-inserting in a hot loop, I certainly hope you don't care about insertion order :stuck_out_tongue:

view this post on Zulip Brendan Hansknecht (Mar 12 2022 at 22:12):

Sure, but we are making the dict care about insertion order by default even if the user doesn't care

view this post on Zulip Richard Feldman (Mar 12 2022 at 22:12):

at which point removeSwap should be fine and have none of these problems

view this post on Zulip Brendan Hansknecht (Mar 12 2022 at 22:13):

Yeah, but I thought the point of this change was to remove the need for users to know about removeSwap and removeShift

view this post on Zulip Richard Feldman (Mar 12 2022 at 22:14):

oh, no - not to have them not need to know about it, more to have the default path work intuitively and have good performance in common use cases

view this post on Zulip Brendan Hansknecht (Mar 12 2022 at 22:14):

Then I would just default to removeSwap. The common case is definitely not depending on the insertion order of a dictionary

view this post on Zulip Richard Feldman (Mar 12 2022 at 22:16):

hm

view this post on Zulip Richard Feldman (Mar 12 2022 at 22:16):

I guess we could try it that way and see if people find it confusing in practice :thinking:

view this post on Zulip Richard Feldman (Mar 12 2022 at 22:16):

and then reconsider the tombstone idea if it's a problem

view this post on Zulip Kevin Gillette (Mar 13 2022 at 00:14):

Perhaps the ordering strategy can just be declared at dictionary initialization time, if one ordering or another is important beyond just guaranteeing determinism of some kind? It seems more valuable to me to have a whole-of-dictionary policy rather than choosing removeShift in this case, then removeSwap immediately after that (when and why would such flexibility be needed)?

If removeSwap only exists for performance, and, were removeShift somehow had the same performance, it doesn't seem like there'd be any reason for removeSwap (we've used phrases like "that's clever" or "it'll be faster," but not "that's actually what I semantically want anyways"). With tombstoning, which sounds promising, it sounds like we're exploring ways to achieve reasonable performance without sacrificing semantics.

An aside: there are likely cases we can easily optimize, such as when a dictionary is proven not to ever be iterated upon (only insertions, deletions, or keyed lookups are used), in which case we don't even need IndexMap, or if we use it anyways (such as for better icache performance), we can use removeSwap for all removals, because order will never be concerned. This decision can also change at function boundaries: if I define a function which accepts a dictionary, I generally don't get to have assumptions about the ordering of items in my input (because if I made such assumptions, the caller could just initialize the dictionary in a different order anyways), thus if the caller doesn't use iteration, then the caller's deletions can use removeSwap, even if my own function uses iteration after deletion, thus my own deletions use removeShift or tombstoning implicitly.

With tomestones, the "reached capacity, thus do a tombstone removal sweep" will likely lead to measurable latency spikes. We can instead use a strategy of making any update pay for a partial tombstone cleanup, such that we maintain an offset in the dictionary indicating the index of the first tombstone (which is the length of the array when there are no tombstones):

  1. Whenever the offset is equal to the length of the array, the update pays no cost, otherwise...
  2. If an insertion occurs, first pay a fixed cost of n element shifts (where n is probably 2 or 3), so that the first tombstone migrates towards the end of the array (eventually becoming capacity), and so that runs of tombstones aggregate together even if starting as spaced-apart tombstones; also update the offset to account for the new location of the first tombstone (if the first tombstone was within n elements of the end of the array, then the insert will be on the end of a contiguous array).
  3. If a deletion occurs, perhaps replace the element with a tombstone and then perform the same shift as insertion (it may end up shifting over itself, or an earlier tombstone if there had been one). Deleting an element earlier than the first tombstone replaces the index with the newly deleted index, thus eliminating the problem of knowing where later tombstones are (the tombstones will all eventually "run into each other" given enough inserts).

Also, two indices (start of first tombstone, and "shift horizon" after the first tombstone) would generally allow for shifting without scanning (or when tombstones collide into spans, thus the horizon offset points at a tombstone, the updating operation is only responsible for advancing the horizon by, say, 2n, presuming reads are cheaper than writes).

Further, since computers are good at memcpy operations, perhaps a index of last tombstone is better, as that allows the tail of array to tend to be contiguous, and larger spans of elements could be shifted "left" via memcpy.

Sorted dictionaries will often be convenient. We may be able to leverage tombstones to make these not too expensive, such as over-allocating array space to allow growth, and using tombstone gaps as an optimization for keeping elements ordered while allowing inputs to only require shifting a finite, or perhaps logarithmic, number of items that follow (or precede, depending), rather than all elements that follow.

view this post on Zulip Richard Feldman (Mar 13 2022 at 01:05):

here's a simple idea: use IndexMap but don't expose removeShift directly, and have Dict be described in the docs as not guaranteeing that insertion order is preserved

view this post on Zulip Richard Feldman (Mar 13 2022 at 01:05):

however, expose sufficient primitives that someone can build OrderedDict in userspace on top of it

view this post on Zulip Richard Feldman (Mar 13 2022 at 01:06):

or perhaps it's possible to build a sufficiently fast one in pure Roc, without even using the builtin Dict!

view this post on Zulip Kevin Gillette (Mar 13 2022 at 01:15):

I am somewhat confused about the big picture. The discussion thus far seems to presuppose Roc dictionary-handling code that can be optimized into in-place mutation, and though we can anticipate that such optimizations can be regularly applied, given that Roc is a language without [observable] side effects, the language allows (and the programmer can even take advantage of) cases in which dictionaries may diverge: storing the value of a dictionary, both before an after an update, in two variables; or passing the dictionary to two separate functions that each make their own updates (which need to get "dropped" if the dictionary is not passed back); or, presumably since concurrent computation would be valuable for the types of computing loads that Roc is targeting, passing a partially initialized dictionary to concurrent tasks which each make their own modifications simultaneously.

Then there's the pathological case of non-tail recursion, where each call passes a modified copy (presume some mix of inserts, updates, and deletions) to the next invocation and then uses the received result alongside some iteration of the dictionary state it already had (thus making it very difficult to optimize this into a single dictionary with in-place mutations).

It does not seem like IndexMap, or indeed most other hashmap implementations, would be suitable for such cases. iirc, Elm uses something with similar properties to a b-tree, such that each update merely is some mask over a reference to the previous state. Would we use a similar technique in cases where optimization is not possible, or would the compiler just arrange to copy the dictionary when it detects that optimizations are no longer safe, or would it use layered dictionaries with multiple tags (Normal, Tombstone, MaskedDelete, MaskedUpdate)?

If the compiler can generate multiple dictionary implementations depending on the optimizations available, would the implementations be transparent and compatible (i.e. would dictionaries containing the same semantic contents be considered equal even if they have differing internal implementations/representations)?

view this post on Zulip Richard Feldman (Mar 13 2022 at 01:28):

the discussion thus far seems to presuppose Roc dictionary-handling code that can be optimized into in-place mutation

that's basically the presupposition of the entire stdlib :big_smile:

view this post on Zulip Richard Feldman (Mar 13 2022 at 01:29):

Roc's entire performance design is based on a hypothesis that this will be either true in practice, or will diverge rarely enough (e.g. costing one or two automatic clones here and there, and then each of those clones gets in-place optimized from then on because each clone is now unique) that this overall strategy will pay off, and be faster overall than persistent data structures would have been

view this post on Zulip Richard Feldman (Mar 13 2022 at 01:29):

we can't know whether the hypothesis will hold true or not without trying it!

view this post on Zulip Richard Feldman (Mar 13 2022 at 01:30):

nobody's ever tried this design before in a production language, as far as I know, so we're in uncharted territory and forging our own path

view this post on Zulip Richard Feldman (Mar 13 2022 at 01:33):

If the compiler can generate multiple dictionary implementations depending on the optimizations available, would the implementations be transparent and compatible (i.e. would dictionaries containing the same semantic contents be considered equal even if they have differing internal implementations/representations)?

I don't think this would ever come up. By definition, the compiler can only possibly generate different implementations based on knowledge it has at compile time, which in practice would mean monomorphized type information. In order for two things to be compared for equality, they have to have the same type - so if the compiler could possibly generate a different implementation, then they couldn't possibly be compared for equality!

view this post on Zulip Kevin Gillette (Mar 13 2022 at 02:35):

That makes sense, thanks!

view this post on Zulip Jared Cone (Mar 13 2022 at 04:52):

I'm in gamedev where perf is often important, but I would still prefer a Dict that preserves order by default at a slight perf cost. IMO it's easier to profile and fix perf issues than finding and fixing bugs caused by false assumptions.

view this post on Zulip Jared Cone (Mar 13 2022 at 04:55):

Regarding equality depending on ordering: I'm new to pure functional, but I would assume that if two values are equal then I could pass either into a pure function and get the same result for both. If ordering isn't considered for equality then that assumption doesn't hold.

view this post on Zulip Richard Feldman (Mar 16 2022 at 23:06):

:thinking: what would be a situation where using a dictionary that preserves insertion order is better than using an ordinary array? (Context: I'm wondering whether Dict.removeShift is actually a good idea or just a footgun because if you want that, you're better off using a List instead.)

does anyone have firsthand experience using a dictionary that preserves insertion order?

view this post on Zulip Brendan Hansknecht (Mar 16 2022 at 23:46):

If you still want O(1) lookup of a value based on a key? an ordinary array doesn't give you that.

But in general, I have never seen code that explicitly depends on insertion order of a dictionary or hashmap. Depends on having order based on the key or value, sure, but never seen insertion order mattering.

view this post on Zulip Brendan Hansknecht (Mar 16 2022 at 23:47):

I wonder if there is a way to look up all the crates that depend on indexmap in rust. They theoretically would have a reason.

view this post on Zulip Richard Feldman (Mar 16 2022 at 23:53):

yeah so far the use cases I've heard of that would apply here are "I'm transforming some configuration JSON or TOML and I want to write it back out in the same order to minimize git diffs"

view this post on Zulip Richard Feldman (Mar 16 2022 at 23:55):

Add it certainly seems reasonable to potentially want to do a transformation that removes an entry while preserving order

view this post on Zulip Richard Feldman (Mar 17 2022 at 00:43):

relatedly, another use case that came up (but which doesn't affect removeShift) is an LRU cache

view this post on Zulip Richard Feldman (Mar 17 2022 at 00:44):

we could support that use case really nicely if we just had a Dict.replaceAt that lets you replace whatever entry happens to be at the given index

view this post on Zulip Richard Feldman (Mar 17 2022 at 00:45):

so you'd give it a new key, value, and index, and then all you'd need to do to make an LRU cache would be to separately track the index and do modulo on the capacity of the dictionary so it wraps around when you get to the end

view this post on Zulip Brendan Hansknecht (Mar 17 2022 at 01:21):

that lets you replace whatever entry happens to be at the given index

Even tombstones?

view this post on Zulip Richard Feldman (Mar 17 2022 at 01:22):

there wouldn't be tombstones in the array, right?

view this post on Zulip Richard Feldman (Mar 17 2022 at 01:22):

we ended up settling on the IndexMap implementation that keeps it densely packed

view this post on Zulip Brendan Hansknecht (Mar 17 2022 at 01:23):

Ah, forgot.

view this post on Zulip Brendan Hansknecht (Mar 17 2022 at 01:29):

I think it has a problem for LRU cache. How do you update an element to move it to the front of the line. You would need to swap it to the front of the list and shift all elements between it and the front of the list over by one.

view this post on Zulip Richard Feldman (Mar 17 2022 at 01:38):

oh good point :thinking:

view this post on Zulip Richard Feldman (Mar 17 2022 at 01:38):

yeah you'd need to do something like remove to have it be fast, swap it with something else

view this post on Zulip Richard Feldman (Mar 17 2022 at 01:38):

but then the other entry will get evicted sooner than it should

view this post on Zulip Kevin Gillette (Mar 17 2022 at 01:41):

Least Frequently Used is often preferable compared to Least Recently Used. With LFU, you're at most swapping two indices in the array, which is pretty cheap. Frequent items shift to the front of the list, rarely used shift towards the end and get pruned

view this post on Zulip Brendan Hansknecht (Mar 17 2022 at 01:42):

Doesn't LFU normally store with PQ or something of that nature so that it is sorted and you can find the least element?

view this post on Zulip Brendan Hansknecht (Mar 17 2022 at 01:43):

In that case, I don't think the insertion order dictionary would be that helpful in general, but maybe I am missing something.

view this post on Zulip Kevin Gillette (Mar 17 2022 at 01:43):

Yes, but if the Dict value contains the frequency info, and we can have arbitrary control over swapping items in the order array, you've got your priority queue

view this post on Zulip Brendan Hansknecht (Mar 17 2022 at 01:44):

hmm. I guess with replace at, whenever you increment the counter for an element, you could check the indices next to it and potentially swap it. Though feels weird to give the end user direct access over the underlying implementation detail of the Dictionary.

view this post on Zulip Kevin Gillette (Mar 17 2022 at 01:45):

yes, agreed, but since some kind of order is important to the functional notion of equality, then we have an opportunity to embrace order

view this post on Zulip Kevin Gillette (Mar 17 2022 at 01:46):

And an LFU cache would probably abstract away those details even when using, internally, a Dict in a low level way

view this post on Zulip Brendan Hansknecht (Mar 17 2022 at 01:47):

One disadvantage is that if swapping is expensive for values, you will be taking that performance hit. with a priority queue just on the keys and a regular hash map, that could be faster. Though you could always box the value if it is large...so yeah, sounds like an LFU would work well here.

view this post on Zulip Brendan Hansknecht (Mar 17 2022 at 01:48):

instead of array of (keys, count) and a hashmap from key to value. It would be an ordered dict from key to (count, value)

view this post on Zulip Kevin Gillette (Mar 17 2022 at 01:49):

indeed

view this post on Zulip Kevin Gillette (Mar 17 2022 at 01:50):

and nicer because a removal from the array or map implicitly involves a removal from the other as well

view this post on Zulip Kevin Gillette (Mar 17 2022 at 01:51):

I get now why it's called an IndexMap

view this post on Zulip Richard Feldman (Mar 17 2022 at 02:17):

if we had functions like these:

update : Dict k v, k, ([ Val v, NoVal ] -> v) -> Dict k v
updateAt : Dict k v, Nat, ([ Entry { k, v }, NoEntry ] -> { k, v }) -> Dict k v

...then maybe we could do an LRU cache too, by storing something like LruCache k v := Dict k { next : Nat, prev : Nat, val : v } which records the indices of the next and previous values

view this post on Zulip Brendan Hansknecht (Mar 17 2022 at 02:19):

Sure, but then the insertion order of the dict doesn't really matter. you are storing the related link list inline and that could be done with any dictionary.

view this post on Zulip Brendan Hansknecht (Mar 17 2022 at 02:20):

I guess you would always want to swap the LRU item with then end of the dictionary to make it easy to find.

view this post on Zulip Brendan Hansknecht (Mar 17 2022 at 02:20):

Some maybe it does have some extra merit here as well.

view this post on Zulip Brendan Hansknecht (Mar 17 2022 at 02:23):

I think UpdateAt would also need Err OutOfBounds

view this post on Zulip Richard Feldman (Mar 17 2022 at 02:23):

I was thinking NoEntry would cover that

view this post on Zulip Richard Feldman (Mar 17 2022 at 02:23):

"there's no entry at that index"

view this post on Zulip Brendan Hansknecht (Mar 17 2022 at 02:24):

Yes, but then when they get a no entry, they give you a { k, v } to insert there.

view this post on Zulip Brendan Hansknecht (Mar 17 2022 at 02:24):

So if it is way off the end of the list, you just grow?

view this post on Zulip Richard Feldman (Mar 17 2022 at 02:24):

oh good point haha

view this post on Zulip Richard Feldman (Mar 17 2022 at 02:25):

I could see an argument for just doing an insert if that happens

view this post on Zulip Richard Feldman (Mar 17 2022 at 02:25):

it would be the fastest - just do the update at min(length, index)

view this post on Zulip Richard Feldman (Mar 17 2022 at 02:26):

actually

view this post on Zulip Richard Feldman (Mar 17 2022 at 02:27):

maybe a better design is:

update : Dict k v, k, (v -> v) -> Dict k v
updateAt : Dict k v, Nat, ({ k, v } -> { k, v }) -> Dict k v

view this post on Zulip Richard Feldman (Mar 17 2022 at 02:28):

and if the key is not found, or the index is out of bounds, we return the original dict unmodified

view this post on Zulip Richard Feldman (Mar 17 2022 at 02:35):

that's what e.g. List.set does on out of bounds, for example

view this post on Zulip Richard Feldman (Mar 17 2022 at 02:36):

could also have

updateOrInsert : Dict k v, k, ([ Entry v, NoEntry ] -> v) -> Dict k v

view this post on Zulip Kevin Gillette (Mar 17 2022 at 04:10):

So per that design, if you have:

emptyDict
  |> update "x" \_ -> 1
  |> update "y" \_ -> 2
  |> updateAt 0 \_ -> {k: "z", v: 3}

Would the resulting dictionary be equivalent to:

{ "z": 3, "y": 2}

In other words, would z completely replace x?
An explicit index swap operation would be quite convenient for those cases in which the keys and values should remain intact, but only the relative ordering should change.

view this post on Zulip Richard Feldman (Mar 17 2022 at 04:13):

if the design is to have them work this way:

Richard Feldman said:

and if the key is not found, or the index is out of bounds, we return the original dict unmodified

...then it would still be an empty dictionary at the end, which honestly seems less surprising to me :sweat_smile:

view this post on Zulip Richard Feldman (Mar 17 2022 at 04:14):

Kevin Gillette said:

An explicit index swap operation would be quite convenient for those cases in which the keys and values should remain intact, but only the relative ordering should change.

that does sound convenient! :thumbs_up:

view this post on Zulip Kevin Gillette (Mar 17 2022 at 04:16):

Richard Feldman said:

and if the key is not found, or the index is out of bounds, we return the original dict unmodified

Sure, there will be use-cases where best-effort is ideal. Pairing that with "checked" operations of some kind, will cover the cases in which the user needs to know whether the change succeeded, though I suppose that can also be done with index/key operations which can return a Result.

view this post on Zulip Brendan Hansknecht (Mar 17 2022 at 04:24):

What does this give?

emptyDict
  |> insert "x" 1
  |> insert "y" 2
  |> insert "z" 3
  |> updateAt 0 (\_ -> {k: "z", v: 2})

{ "z": 2, "y": 2 }? cause giving { "z": 2, "y": 2, "z": 3 } would be broken.
The update at just feels really strange to me because it ignores the existing elements.

view this post on Zulip Brendan Hansknecht (Mar 17 2022 at 04:27):

Also, if update at creates { "z": 2, "y": 2 }, it means that it has to remove the old z. In that case, I assume it would do a swap remove which would reorder part of the list? This might have some weird gotchas if you try to use it for an LRU or LFU cache.

view this post on Zulip Kevin Gillette (Mar 17 2022 at 05:49):

Indeed, so any updateAt could actually impact up to n elements in the array (and 2 [?] 3 [?] elements in the hashmap), such as with:

emptyDict
  |> insert "x" Before
  |> insert "y" Before
  |> insert "z" Before
  |> updateAt 1 (\_ -> {k: "x", v: After})

y would disappear, and x would be moved to the middle element (which would cause the first element, the former location of x, to disappear?), and then, without tombstones, z and any elements after it, would need to shift to fill the newly formed gap.

view this post on Zulip Brendan Hansknecht (Mar 17 2022 at 05:52):

z and any elements after it, would need to shift to fill the newly formed gap.

That or swap the last element into the hole since that is the default remove functionality?

view this post on Zulip Kevin Gillette (Mar 17 2022 at 05:53):

For an LFU cache, I'd just rely on single swaps, typically adjacent: as soon as you get a cache hit, you increment the count for that key, and if the increment puts it higher than its left-side (lower index) neighbor, swap the two positions. In the pathological case where there's a run in k keys that all have the same count and you increment the last one in the run, then just swap the first and last elements in that run (the relative ordering among elements with the same count shouldn't matter if they have the same count). Scanning to find the swap partner might thus take arbitrary time without some auxiliary data of some kind, but the actual swap would always be constant time.

view this post on Zulip Kevin Gillette (Mar 17 2022 at 05:55):

Brendan Hansknecht said:

z and any elements after it, would need to shift to fill the newly formed gap.

That or swap the last element into the hole since that is the default remove functionality?

In the general case, that could be fine, or not, depending on whether, and how, the application depends on ordering. For applications which don't care about Dict ordering, then any swap to fill a hole would be fine, assuming the swap partner is deterministically selected.

view this post on Zulip Giovanni Zanandrea (Mar 21 2022 at 12:16):

I wonder if it wouldn't be better to have a builtin dictionary that doesn't expose the ordering of the elements at all, and then build something similar to Rust's IndexMap on top of it in a library

The builtin version wouldn't have a toList method (or any other methods that exposes an internal ordering), but it could still have all order-independent methods like toSortedList, map, lookup, update and so on. And it wouldn't have the above mentioned problems with equality.

From this discussion it sounds like the version of the dictionary that remembers the insertion order would have to store the order in a separate array anyway, so implementing it on top of the order-independent builtin version should not create performance problems. And since it looks like there's more than one useful way to manage the order (removeSwap, removeShift, ...), maybe we could also have different versions of the ordered dictionary for different use cases.

Am I missing something?

view this post on Zulip Anton (Mar 21 2022 at 12:30):

Using toList on a dict comes up fairly often in data science so I imagine it would be missed.
In data science you may also have very large dicts, so you don't want to waste time sorting things when it's not needed.

view this post on Zulip Brendan Hansknecht (Mar 21 2022 at 14:51):

Also, not too uncommon to want to iterate over a Dict or Set.

view this post on Zulip Brendan Hansknecht (Mar 21 2022 at 14:52):

It would be problematic if that required building and sorting a list.

view this post on Zulip Giovanni Zanandrea (Mar 22 2022 at 07:11):

Sorry, I don't think I explained myself clearly here. I didn't mean to suggest that we should give up the ability to iterate through the elements of a dictionary without first doing some expensive sorting. The point I was trying to make is a different one. So please allow me to elaborate a bit. As an example I'll use a set rather than a dictionary: the issues are the same, but using sets simplifies the discussion a bit.

A set is, by definition, a collection that contains no duplicates and whose elements are not ordered. A "set" that remembers the order in which its elements were inserted is not a set anymore, it's something else. It's a list/queue without duplicates that also provides the ability to quickly locate the position of a given element. I personally feel very uncomfortable about having a set or dictionary type that implements methods like the above mentioned updateAt.

If I had to implement this weird data structure I would do something like this:

    template <typename T> class not_a_set {
      // Main data structure, the one that contains all the elements in their proper order
      vector<T> elements;

      // Auxiliary data structure, maps a hashcode to the
      // indexes of all elements that have that particular hashcode
      unordered_map<u32, set<u32>> hashcode_to_indexes;

    public:
      bool operator == (const not_a_set<T> &other) const {
        // For a comparison we can ignore hashcode_to_indexes,
        // elements contains all the information we need
        return elements == other.elements;
      }

      bool contains(const T &element) {
        return find(element, calculate_hashcode(element)) != 0xFFFFFFFF;
      }

      vector<T> &to_list() {
        return elements;
      }

      void insert(const T &element) {
        u32 hashcode = calculate_hashcode(element);
        u32 idx = find(element, hashcode);
        if (idx == 0xFFFFFFFF) {
          elements.push_back(element);
          hashcode_to_indexes[hashcode].insert(elements.size() - 1);
        }
        else {
          ...
        }
      }

      void remove_swap(const T &element) {
        ...
      }


    private:
      u32 find(const T &element, u32 hashcode) {
        set<u32> &indexes = hashcode_to_indexes[hashcode];
        for (auto it=indexes.begin() ; it != indexes.end() ; it++) {
          u32 idx = *it;
          if (element == elements[idx])
            return idx;
        }
        return 0xFFFFFFFF;
      }
    };

The main data structure here, the one that contains all the elements, is a simple list/vector/array. The other member variable (hashcode_to_indexes) is only an auxiliary data structure used to quickly locates the index of a given element.

Node that here I never need the ability to iterate through hashcode_to_indexes, because if I need to iterate through the collection, I'll just iterate through the elements vector.

Also, this is what I would do if all I had to implement was remove_swap(...). If I had to (efficiently) implement remove_shift(...) I would use a completely different (and more expensive) data structure.

So if I had to design a collection library I would start by implementing unordered collections like HashSet and HashDict, probably as builtin types, for efficiency. I wouldn't want their elements to have any externally observable order because the order of the elements is extra information we would have to keep track of, and that's bound to constraint the implementation and make it less efficient. And since there's no order, I would avoid any method (like for example toList) that exposes the internal ordering of the elements, because that would introduce nondeterminism in the language.

Then I would use those builtin types as building blocks to implement more capable library types like NotASetWithRemoveSwap, NotASetWithRemoveShift, NotADictWithRemoveSwap, NotADictWithRemoveShift and so on (I would choose nicer names, of course).

Then in my application code I would use HashSet and HashDict only when performance is super important and I don't need to iterate through the entire collection, and the derived library types in all the other cases.

view this post on Zulip Richard Feldman (Mar 22 2022 at 11:08):

that seems like a reasonable (and clever!) design to me except that one of the original goals was to come up with a "default dictionary" (and Set) that almost everyone in the ecosystem feels comfortable reaching for unless they have very unusual requirements

view this post on Zulip Richard Feldman (Mar 22 2022 at 11:10):

I'd expect iterating over the dictionary to be a common requirement, so if we adopted this design, I think we'd want to have both data structures included in the builtins

view this post on Zulip Richard Feldman (Mar 22 2022 at 11:11):

at which point I think the guidance would be "always use the ordered one because you can iterate over it"

view this post on Zulip Richard Feldman (Mar 22 2022 at 11:11):

but that makes the naming question more prominent

view this post on Zulip Richard Feldman (Mar 22 2022 at 11:13):

I understand your point about "if it's ordered, it no longer feels like a Set" - but if the advice is "when you want a Set, don't reach for the thing named Set because you can't iterate over that one" seems to be unavoidably at odds with the design goal of having an obvious default data structure to reach for! :sweat_smile:

view this post on Zulip Brendan Hansknecht (Mar 22 2022 at 15:29):

A bit extra to add on top of what Richard said:

Firstly, I want to make a clear distinction between mathematics and implementation. For example, any programming implementation of a set contains ordered elements. They are all implemented on top of some form of list. That is part of the linear memory space we on constrained to. With that in mind, I would say that the simplest implementation of a set, is just a list. As long as you scan the entire thing when you insert, you can make sure it will never have duplicates. That alone is enough to build the data structure we have describe above. It would be slow, but it would a minimal functional implementation that has no dependency on any underlying dictionary/set implementation.

Why does that matter? Once mathematical definition and implementation details are separated, an important property opens up. We get to decide how much of them implementations is useful beyond the core set functionality. Given we are working with constrained hardware, taking advantage of implementation knowledge is very important for performance. The reason for things like updateAt is not because we think that it is a reasonable method on mathematical sets. updateAt is part of our specific implementation that trivially enables multiple use cases, like an LRU or LFU cache. Without exposing this, those data structures would get built extremely differently in user space. Most likely in a more complex and slower form.

All of this said, I would also like to note, that I personally think that we should just say that referential transparency is only per individual complilation, expose Num.appSeed, and make our default dictionary an unordered form with iteration. At least currently, that is not the plan for Roc.

view this post on Zulip Giovanni Zanandrea (Mar 23 2022 at 09:45):

OK, gotcha, but I still feel that either the mathematical or the computer science community should come up with a short and catchy name for the kind of data structure we're talking about. We have the term set, for collections that have no order and no duplicates; bag/multiset for those that can have duplicates, but no order; sequence/list for those that can have duplicates and whose elements are arranged in a linear order; but nothing for those with linear order but no duplicates. A glaring omission that really bothers me.

About the implementation vs mathematical definition: I think there are very strong arguments for sticking to the mathematical definition whenever possible. The mathematical structures have properties that are usually lost when you deviate from them. These properties are for example used by query engines for declarative languages like SQL or Datalog to rewrite queries and find more efficient ways to execute them. I'm no expert on that, but I've read that even the fact that SQL relations can contain duplicates makes optimization more complicated. Another benefit of not committing to any particular implementation is that, yes, you'll loose the extra perks that specific implementation can provide, but on the other hand the compiler/runtime/DBMS will be free to pick whatever implementation is best suited for any given task. And that's exactly what I do with my own language, which makes heavy use of sets and relation. The compiler analyzes the entire code base to see how each individual relation is used, and then picks the most efficient implementation that provides the required functionality. Without that, there's no way I could be competitive in terms of speed with, say, Java. And when it's not possible (or just too hard) to pick the best implementation at compile time, it's done under the hood (with much less efficiency, unfortunately) at runtime (More on that below.)

I'm sympathetic to the idea of restricting referential transparency to individual compilations, and that's to some extent what I do in my own language, since I see no way there to provide deterministic iteration over unsorted collections without totally killing performance. And in my case the situation is much worse that it would ever be in ROC, not only because sets and relations are the primary way to encode data, but also because mine is an embeddable language: the compiler generates a sets of classes in Java, C# and (soon) C++, that are meant to be included in an existing codebase, and each of those code generators use a totally different notion of internal ordering. In spite of that, I've never had in practice to fix a bug caused by nondeterminism, and I think that's even less likely to happen in ROC (it might be more of a problem in imperative languages, I guess). It's still annoying though that the same program can produce different (but equally valid) outputs when compiled by different versions of the compiler. I'm currently working on the C++ code generator, I'd often like to test it against the Java one by comparing the outputs of the same program, but in many cases I can't, because those outputs are different.

On a separate, but related, subject: is the idea of adding tree-based sets and dictionary to ROC being considered? They can provide functionalities that are not available in their hash-based counterpart (min, max and i-th elements, range queries, ...) plus they have a meaningful and efficient iteration order, and I think they can greatly benefits from being implemented as builtins. Let me explain why.

In my own language sets and maps are a builtin type, just like, for example, lists, and their implementation is an implementation detail that doesn't affect the semantics of the language. When a set/map is created in a single operation, for example using map/filter or a literal, the language stores it using a sorted array, which is the most memory-efficient representation. But if afterwards an individual element/entry is added to, or removed from, the set/map, the runtime transparently switches (in O(1)) to a hybrid representation that combines binary trees and sorted arrays (basically, a binary tree whose leaves contain sorted arrays, not just single elements) which can support O(log(N)) insertions and deletions. If this binary tree later needs to be copied (something that could be pretty common in ROC, as far as I understand), then the copy is arranged in the more efficient sorted array representation.

Some of this can be implemented in user space, but not all of it. For example having to iterate through a very deep binary tree is a pretty good opportunity to very cheaply rearrange its content in the more efficient sorted array representation, but since iteration is not supposed to change the collection this has to be done under the hood.

I think this applies (to a lesser extent) to the NotASet/NotADict types as well. For example, one reasonable way to implement removeShift (which, I believe, is a much more logically sound default than removeSwap) is to leave a hole in the array and postpone the (expensive) shifting operation until it's actually needed. And an iteration over the collection is an excellent opportunity to do that at a minimal cost, but again, that has to be done under the hood. Same goes for having to make a copy of the collection.

One more silly though: if the difference in performance between HashSet/HashDict and NotASet/NotADict turns out to be significant, wouldn't it be possible for the compiler to automatically replace the latter with the former whenever it can detect that the ordering is not used? I realized that's difficult to do in a functional language, but at least in some simple cases that might be within the ability of a moderately sophisticated analyzer. It may turn out to be not much harder than detecting at compile time when a data structure has a single reference and can be safely updated in place, which I seem to understand you guys are already doing?

view this post on Zulip Richard Feldman (Mar 23 2022 at 12:59):

Giovanni Zanandrea said:

wouldn't it be possible for the compiler to automatically replace the latter with the former whenever it can detect that the ordering is not used?

the compiler definitely has all the information necessary to do that!

I think the question is whether the benefits would actually be worth it in practice. :big_smile:

view this post on Zulip Kevin Gillette (Mar 23 2022 at 13:17):

I'm not able to find a mention of representational ordering, or lack thereof, at least on the [Wikipedia page] (https://en.m.wikipedia.org/wiki/Finite_set), though I'm not surprised because abstract mathematics doesn't care about performance (statements don't need to be computable in finite time or finite memory), and so there's no difference between "iterate over this set which contains some integers..." and "enumerate over every real number, and for those which are present in this set..."
The page does mention a lot about order, but not _representational_ order (irrelevant at the abstract mathematics level), but rather as a way to define the finiteness of finite sets, and as a way to relate sets themselves relative to each other (defining subset and so on).

I'm not so sure that it's actually mandated by computer science definitions that there be no defined order (for either a set nor a mapping). Languages often do not define order so that they can change the implementation without breaking expectations, and also so that they can use less memory or utilize simpler internal structures. If ordering had no implication on internal complexity or overhead, then I suspect we'd see defined order as the default.

In a referentially transparent language, like Roc, it's critical for the same operations to produce the same answer, which implies that data structure ordering must at least be deterministic based on creation: at least within a single program run, constructing a set with b, then a, must iterate in the same order as any other set created with b, then a. It has been called out that, if we can provide consistency across program runs, and across binaries compiled within the same major version of Roc, then that'd be a lot less confusing.

We certainly can give special names to sets and types which have a defined order (in Rust, the nearest equivalent being discussed is IndexMap). But is the value in that to distinguish between multiple types within the same language, or to just clarify the semantics of the default type? My sense is that it's the latter.

I've worked on a 3rd party set implementation for Go, which happens to be based on sorted, deduplicated arrays
and merge algorithms (and it happens to be much faster to do intersection and such that way than is available via Go's builtin maps): I never thought of calling this anything other than just "Set". Usually most people wanting to print out a set would prefer to do it in sorted order (when the elements are sortable), so it saves them effort if they don't need to convert to a list and then sort.

view this post on Zulip Brendan Hansknecht (Mar 23 2022 at 17:36):

but on the other hand the compiler/runtime/DBMS will be free to pick whatever implementation is best suited for any given task

Personally I see that as the job of the human, but I tend to work on lowish level stuff. So very clear bias. I also am a strong believer in Hyrum's Law, so I think it would be exceedingly hard to do this on software at scale. Minute differences will lead to breaking changes. That or hiding the differences will lead to unacceptable performance losses.

The compiler analyzes the entire code base to see how each individual relation is used, and then picks the most efficient implementation that provides the required functionality. Without that, there's no way I could be competitive in terms of speed with, say, Java.

I don't understand this. Java picks a specific implementation. You could just pick the exact same implementation and be competitive with Java. I also would bet that there are some implementations that are all around better than the Java implementation.

but also because mine is an embeddable language: the compiler generates a sets of classes in Java, C# and (soon) C++, that are meant to be included in an existing codebase

I find it really interesting how Roc basically has the same constraint. It is just always embedded using object files and C abi.

view this post on Zulip Brendan Hansknecht (Mar 23 2022 at 17:39):

I'm currently working on the C++ code generator, I'd often like to test it against the Java one by comparing the outputs of the same program, but in many cases I can't, because those outputs are different.

Can't you do unordered comparisons of the output, or is it more complex then that?

view this post on Zulip Brendan Hansknecht (Mar 23 2022 at 17:41):

is the idea of adding tree-based sets and dictionary to ROC being considered? They can provide functionalities that are not available in their hash-based counterpart (min, max and i-th elements, range queries, ...) plus they have a meaningful and efficient iteration order, and I think they can greatly benefits from being implemented as builtins.

I don't think their iteration is efficient (lots of pointer chasing), but otherwise, totally agree. Maybe we need to reconsider having an ordered and unordered dictionary type base on thing like (min, max and i-th elements, range queries, ...).

That being said, I think it would be easier to build an efficient tree based set in roc than an efficient hash based set in roc.

view this post on Zulip Brendan Hansknecht (Mar 23 2022 at 17:44):

I'm currently working on the C++ code generator, I'd often like to test it against the Java one by comparing the outputs of the same program, but in many cases I can't, because those outputs are different.

In my own language sets and maps are a builtin type, just like, for example, lists, and their implementation is an implementation detail that doesn't affect the semantics of the language

Aren't these contradictory? If there are observable difference, it must effect the semantics?

view this post on Zulip Brendan Hansknecht (Mar 23 2022 at 17:45):

if this binary tree later needs to be copied (something that could be pretty common in ROC, as far as I understand), then the copy is arranged in the more efficient sorted array representation.

Roc does a lot of work to try and do in-place mutation, so hopefully not.

view this post on Zulip Brendan Hansknecht (Mar 23 2022 at 17:46):

For example having to iterate through a very deep binary tree is a pretty good opportunity to very cheaply rearrange its content in the more efficient sorted array representation, but since iteration is not supposed to change the collection this has to be done under the hood.

Yeah, should be doable in userspace, but would be really odd: Why does iterating over my set return a new set?

view this post on Zulip Brendan Hansknecht (Mar 23 2022 at 17:49):

For example, one reasonable way to implement removeShift (which, I believe, is a much more logically sound default than removeSwap) is to leave a hole in the array and postpone the (expensive) shifting operation until it's actually needed. And an iteration over the collection is an excellent opportunity to do that at a minimal cost, but again, that has to be done under the hood.

This would be really expensive in a common dictionary use case of simply adding, removing, and looking up many items. There would never be a good time to run the remove shift. The underlying memory allocation would just keep growing due to all of the empty slots in the underlying list. It is not possible to fill in the empty slots without making insertion an O(n) scan, and that would break our insertion ordering anyway.

view this post on Zulip Giovanni Zanandrea (Mar 24 2022 at 14:37):

Kevin Gillette said:

In a referentially transparent language, like Roc, it's critical for the same operations to produce the same answer, which implies that data structure ordering must at least be deterministic based on creation: at least within a single program run, constructing a set with b, then a, must iterate in the same order as any other set created with b, then a. It has been called out that, if we can provide consistency across program runs, and across binaries compiled within the same major version of Roc, then that'd be a lot less confusing.

As I see it, that's a somewhat watered-down version of referential transparency. Yes, program runs are guaranteed to be deterministic, but individual functions are not:

  s1 = emptySet |> insert 1 |> insert 2 |> insert 3
  s2 = emptySet |> insert 2 |> insert 3 |> insert 1
  r1 = f s1
  r2 = f s2

s1 and s2 are identical for any meaningful definition of set, but r1 and r2 need not be. Note that the problem completely disappears when using binary trees or ordered arrays. Don't get me wrong, I totally agree that even this looser definition of referential transparency is extremely useful in practice, and I'll be very happy to take it, but I still like the real version better

Kevin Gillette said:

I've worked on a 3rd party set implementation for Go, which happens to be based on sorted, deduplicated arrays and merge algorithms (and it happens to be much faster to do intersection and such that way than is available via Go's builtin maps): I never thought of calling this anything other than just "Set". Usually most people wanting to print out a set would prefer to do it in sorted order (when the elements are sortable), so it saves them effort if they don't need to convert to a list and then sort.

I'm curious. Do you have a reference to the algorithms that you're using?

view this post on Zulip Giovanni Zanandrea (Mar 24 2022 at 14:44):

Brendan Hansknecht said:

I'm currently working on the C++ code generator, I'd often like to test it against the Java one by comparing the outputs of the same program, but in many cases I can't, because those outputs are different.

In my own language sets and maps are a builtin type, just like, for example, lists, and their implementation is an implementation detail that doesn't affect the semantics of the language

Aren't these contradictory? If there are observable difference, it must effect the semantics?

No. There are observable differences between the different versions of the compiler, but for any given version you cannot tell how a collection is implemented.

Brendan Hansknecht said:

For example having to iterate through a very deep binary tree is a pretty good opportunity to very cheaply rearrange its content in the more efficient sorted array representation, but since iteration is not supposed to change the collection this has to be done under the hood.

Yeah, should be doable in userspace, but would be really odd: Why does iterating over my set return a new set?

It doesn't. That's exactly my point. It just rearranges the content of the existing set under the hood. That's why that cannot be implemented in userspace in a pure functional language. I think this can be explained best with a few lines of code:

      // Creating a new sequence/list. The list contains a mix of different data types
      // (strings, integers, floating point numbers, dates, times and symbols) so there
      // can be no meaningful notion of order between them
      seq = ("Hi!", 42, `2022-03-24`, 3.14159, :a_symbol, `2022-03-24 10:20:53`, ...);

      // Turning that sequence into a set, using a comprehension expression. The set
      // is created with a single operation, so it is represented as a sorted array
      set_1 = [x : x <- seq];

      // Turning again the sequence into a set, but starting with the empty set and
      // inserting elements into it one at a time. Here the compiler is forced to use
      // a less efficient tree-based representation
      set_2 = [];
      for x <- seq
        set_2 = insert(set_2, x);

      // In spite of the fact that set_1 and set_2 have different
      // internal representations they still compare equal
      assert set_1 == set_2;

      // Iterating through the first set, and storing its element in a sequence. This
      // is the kind of operation that exposes an implementation detail that ideally
      // should be kept hidden:
      seq_1 = ();
      for x <- set_1
        seq_1 = (seq_1 | x);

      // Doing the same with the other set
      seq_2 = ();
      for x <- set_2
        seq_2 = (seq_2 | x);

      // Despite the fact that set_1 and set_2 have different representations, their
      // iteration order is the same and the resulting sequences compare equal
      assert seq_1 == seq_2;

      // Now set_2 is not stored as a binary tree anymore. It's now using a sorted
      // array. During the previous iteration iteration its elements were internally
      // rearranged as a sorted array. Since we're already retrieving all elements
      // of the set, the rearrangement can be done with minimal overhead, but it
      // has to be done under the hood. If there had been other references to set_2
      // throughout the program, they would all now see the sorted array

      // Inserting a new element into set_1 (which uses a sorted array).
      // The result, set_3, is represented as a single node tree. The root
      // stores the new element (2.71828), the left branch points to the section
      // of set_1's array that contains the elements that come before the new
      // one, and the right branch points the other section of set_3's array
      // The whole insertion takes O(log(N))
      set_3 = insert(set_2, 2.71828);

      // These two print instructions now produce the exact same output, but
      // this output will be different with each version of the compiler
      print seq_1;
      print seq_2;

Another example of separation between logical definition and physical implementation is records. Records in my language are just dictionaries (which are in turn just a special case of binary relations) that map symbols to arbitrary values (that has several advantages that are explained here). But of course they need an efficient representation. So if you define in your code a type like this one:

      type Rect = rect(left: Int, bottom: Int, width: Nat, height: Nat);

the compiler will produce the following definition in the generated code:

      typedef struct {
        int64 left_L;
        int64 width_L;
        int64 bottom_L;
        int64 height_L;
      } OBJ_Rect;

and all values that belong to the type Rect will be stored using that optimized data structure. You can create them directly:

      rect = rect(left: 4, bottom: 1, width: 5, height: 3);

or by assembling its pieces in a generic fashion:

      tag = :rect;
      keys = (:left, :bottom, :width, :height);
      values = (4, 1, 5, 3);
      rect = tag([k -> values(i) : k @ i <- keys]);

or even by ridiculously convoluted means:

      left = 4; bottom = 1; width = 5; height = 3;
      str = "rect(left: " & print(left) & ", bottom: " & print(bottom) & ", width: " &
                print(width) & ", height: " & print(height) & ")";
      res = parse(str);
      rect = if succeeded(res) then result(res) else undefined;

but you'll end up with the same physical implementation (obviously, only the first way of creating it is efficient). If you hadn't defined the Rect type, the above snippets of code would still be valid, but the result would be stored as a generic (tagged) dictionary.

Brendan Hansknecht said:

I'm currently working on the C++ code generator, I'd often like to test it against the Java one by comparing the outputs of the same program, but in many cases I can't, because those outputs are different.

Can't you do unordered comparisons of the output, or is it more complex then that?

No, unfortunately that wouldn't work in general. Silly example:

      // The following snippet of code prints `42` when compiled with
      // the C++ code generator, and `2022-03-24` with the Java one
      set = ["Hi!", 42, `2022-03-24`, 3.14159, `2022-03-24 10:20:53`];
      for x <- set {
        print x;
        break;
      }

The C++ version at least guarantees the natural iteration order for things like numbers, dates and times without affecting performance, but when generating Java code I haven't even been able to do that.

Brendan Hansknecht said:

I don't think their iteration is efficient (lots of pointer chasing), but otherwise, totally agree. Maybe we need to reconsider having an ordered and unordered dictionary type base on thing like (min, max and i-th elements, range queries, ...).

I used the expression "tree-based" for simplicity, but what I really meant was an implementation (or a combination of implementations) that rely on ordering (as opposed to hashcodes) to organize the elements of the set/dictionary. Once you start using sorted arrays under the hood as much as possible, in the average case everything becomes a lot faster. And there are other implementations as well. Kevin Gillette in the previous post mentioned one that I'm not familiar with and which I'm actually curious about. And then there are implementations that only work with integer keys, but which are supposed to be very fast, like Judy arrays.

view this post on Zulip Giovanni Zanandrea (Mar 24 2022 at 14:48):

Brendan Hansknecht said:

The compiler analyzes the entire code base to see how each individual relation is used, and then picks the most efficient implementation that provides the required functionality. Without that, there's no way I could be competitive in terms of speed with, say, Java.

I don't understand this. Java picks a specific implementation. You could just pick the exact same implementation and be competitive with Java. I also would bet that there are some implementations that are all around better than the Java implementation.

I need a more complex example to explain this (sorry):

      type EmployeeId = employee_id(Nat);
      type CompanyId  = company_id(Nat);

      schema Workforce {
        employee(EmployeeId)
          first_name:     String,
          last_name:      String,
          date_of_birth:  Date,
          soc_sec_no:     SocSecNum [unique];

        company(CompanyId)
          name: String;

        works_for(EmployeeId, CompanyId)
          salary:     Money,
          start_date: Date;

        works_for(e, c) -> employee(e), company(c);
      }

The above snippet of code defines what I call a relational automaton. Think of it as a sort of class that contains an entire mini-database inside. It works a bit like the Elm architecture. It can receive messages, which are processed by an update function. Automata don't share state and everything is quasi-deterministic. The major differences with TEA are that the state of the automaton is encoded using relations (as opposed to functional data structures), and that the update function instead of returning the new state returns a set of insert/update/delete and other atomic instructions that are then applied to the existing state to produce the new one (details here)

Now, that's hidden behind a lot of syntactic sugar, but the state of the above automaton is encoded using two unary relations/sets (employee and company), six binary relations (first_name, last_name, date_of_birth, soc_sec_no, name, works_for) and two ternary relations (salary and start_date), plus a set of keys and foreign keys.

Now, the relational model is IMHO by far the best way we have to encode certain types of information, but if implemented naively it's also pretty slow. So here the compiler needs to analyse how each relation is used, and pick the most efficient representation that can do the job. Let's take for example works_for. Workforce is likely to have methods that look like this:

      // Given the id of an employee, returns the set of companies it works
      // for (an employee can have more than one job)
      [CompanyId]  employers(EmployeeId e) = [c : c <- works_for(e, ?)];

      // Given the id of a company, return the set of employees
      [EmployeeId] employees(CompanyId c)  = [e : e <- works_for(?, c)];

Once the compiler see the code above, it knows that works_for has to be navigated in both direction, so it needs a more capable and expensive data structure to implement it. Same goes for soc_sec_no, which needs to guarantee the unicity of SSNs. But all the others relations, assuming that you don't actually search on any of them, are implemented using data structures that are nearly identical to the ones used in the Entity-Component-System architecture, which is used in videogames exactly because it's very fast. That's what I mean when I say that I couldn't be competive with a language like Java if the compiler wasn't free to pick the best implementation for each individual case.

Brendan Hansknecht said:

but on the other hand the compiler/runtime/DBMS will be free to pick whatever implementation is best suited for any given task

Personally I see that as the job of the human, but I tend to work on lowish level stuff. So very clear bias. I also am a strong believer in Hyrum's Law, so I think it would be exceedingly hard to do this on software at scale. Minute differences will lead to breaking changes. That or hiding the differences will lead to unacceptable performance losses.

We'll have to agree to disagree on this one. I believe that the job of a high level language is to work out the implementation details for you as much as possible. Even when the compiler can't do that in an optimal way because it lacks information a human would have, it can be nudged in the right direction using declarative annotations, for example. Obviously, that approach won't alway work. But isn't that true of all high-level languages? ROC itself is probably not going to completely eliminate the need for languages like C/C++/Rust/Zig, but does that mean it's not useful? I think not, and I'm actually very excited about it, so much so that I hope to be able to contribute to it at some point in the future.

view this post on Zulip Brendan Hansknecht (Mar 24 2022 at 17:55):

There are observable differences between the different versions of the compiler, but for any given version you cannot tell how a collection is implemented.

So if someone switches from the Java version to the C++ version, they might not be able to directly port their Java code to C++ because they need to deal with differences between the Java and C++ version of the compiler?

It doesn't. That's exactly my point. It just rearranges the content of the existing set under the hood.

Sorry, with the re-arranging in user space, I meant that you could do it explicitly. I totally understand that you lose the implicit/under the hood transformations. So I don't think an explicit user space version would be at all reasonable. Also, the under the hood re-arranging (assuming no accidentally noticable side effects) is a really cool idea.

Your record example is very intriguing. How do you actually decide when to swap implementation and not use a struct?

which needs to guarantee the unicity of SSNs

Complete aside, I was an SWE intern at a hospital company and this was not true in their data....which was concerning to say the least. They probably should be using the contradictions to detect fraud.

That's what I mean when I say that I couldn't be competive with a language like Java if the compiler wasn't free to pick the best implementation for each individual case

Thanks for the clarification. I guess the core information here is that your language is far from implementation. So if someone were to make the same functionality in raw Java, they might go about it in a totally different way. So you language has to consider a lot around lowering to get competitive speeds.

We'll have to agree to disagree on this one. I believe that the job of a high level language is to work out the implementation details for you as much as possible.

Yeah, as I said, I tend to work pretty low level, so I have a lot of bias. I just generally have found being lower and having more control was worth it for the performance. I think a lot of this discussion depends on how high level Roc actually is. I want to use Roc as a low level langauge. I want to be able to use it were I would currently reach for Rust or C++. With that in mind, I don't want to heavily depend on compiler heuristics for performance or pay the cost of "under the hood optimizations" that I don't actually need. So I understand both sides, but my want for "high level" is much much lower than your form of "high level".

On the other hand, with something high level like python or javascript or SQL, I would gladly take your optimizations as long as they aren't too slow to compile/interpret.

ROC itself is probably not going to completely eliminate the need for languages like C/C++/Rust/Zig

Roc depends on C/C++/Rust/Zig. You can't compile a Roc app without a platform, and a platform has to be written in a language like those. As I allude to earlier, my hope would just be that roc is essentially as fast and only slightly higher level than those languages, but while being a pure functional programming language. That will be had to do at scale, but I think the project will manage this for the most part.

view this post on Zulip Kevin Gillette (Mar 26 2022 at 04:11):

Giovanni Zanandrea said:

I'm curious. Do you have a reference to the algorithms that you're using?

Hah, I wish. I haven't found anything in the accessible literature. I've pretty much using what is described in https://en.wikipedia.org/wiki/Merge_algorithm (iow, "MergeSort"), except that the inputs are sets (lists) that are _already_ sorted (and are deduplicated), and the merge details vary based on the set operation, for example set intersection skips elements from either list that are not present in both lists; and none of the operations produce duplicates as a merge sort would. Also, there are opportunities to use binary search for long non-overlapping sequences, and some of the techniques found in Timsort.

The main downside seems to be that single-element membership tests are logarithmic time rather than amortized constant time as you'd get with a hashmap, though multiple-element membership tests ("are all present" via subset checking, and "are any present" via intersection checking) does pretty well, since some elements can be easily excluded by comparing min/max set values, and the inherent sorted order can provide some optimizations that just aren't available to a hashmap.

iiuc, with sets built on pure hashmaps using hash functions with good distribution, the "amortized constant time" factor can add up to a significant/notable amount of overhead when performing pair-of-sets operations (like union, intersection, etc): if the two sets have identical hashing functions, then at least the hash cost would not need to be paid more than once per element; if the two sets also have an identical number of buckets (same approximate size/capacity), then hashing doesn't need to be performed, since buckets can just be walked pair-wise; but in any case, assuming each bucket has a list of 8-16 ("k") elements that are in arbitrary order, that means a k-element walk in set b for every element in set a for an operation (if searching for a non-match, such as set difference, otherwise average k/2 if searching for a match that happens to exist, such as in set intersection). If a hash is needed, and the elements are long strings (assuming there's no auto-interning of strings), then hash can be arbitrarily expensive: sorted-list sets don't have this issue, nor the k-element sub-walk overhead.

view this post on Zulip Kevin Gillette (Mar 26 2022 at 04:56):

Giovanni Zanandrea said:

I believe that the job of a high level language is to work out the implementation details for you as much as possible.

I generally agree with this. C++ is a language that has ~80 keywords, many/most of which are defunct or have been repurposed because C++' goal was to provide performance _now_ even when we could be confident that smarter compilers would obsolete the value of that low level control later. I'm not convinced C++' approach is in any way sound for a modern language: if we believe something _can_ be optimized later, maybe we can provide an escape hatch of some kind, but the core language or libraries should probably not express distinctions which are relevant to performance unless such details also better clarify intent ("I'm using this kind of mapping because it has these powerful _extra_ capabilities I need," rather than "I'm using this kind of mapping because, while it has an identical API to all the others, it has better performance in my specific use case").

Having one (or very few) standard contract for a hashmap, list, etc, does mean that it can more flexibly be optimized in various and transparent ways, such as @Giovanni Zanandrea is describing. Also, having a single simple implementation is also a form of optimization considering it has great instruction cache utilization compared to many compiled datastructure variants. The point is that, at least for the kind of work I do (I'm certain it varies by person and hence is a great discussion), I need _pretty good_ performance and _pretty reasonable_ overhead, but once that's obtained, prefer simplicity above optimal performance or other concerns.

Perhaps there's some middle ground where the language provides a default Dict type with a variety of capabilities, which compiles into one or another Dict implementation residing within IHopeYouKnowWhatYourDoing.Dict submodules depending on which capabilities are being used (if you never iterate over the Dict, then a lower-overhead order-undefined classic hashmap can be selected, for example). These specialized Dict implementations could be used directly by those who need direct control.

However, in my experience, these specialized implementations should probably not be named WyHash or FnvHash or whatever, but rather named after their general tradeoffs, and only document that "right now this uses Fnv." Remember that md5 was considered pretty good and general purpose at one point (albeit an entirely different kind of hashing function). More to the point, various bespoke or temporarily well-regarded dictionary hashing algorithms, that turn out to have distribution problems or pathological cases, have also come and gone over the years; or if they were actually pretty good, then regularly enough something that is strictly better at achieving the same tradeoffs comes along (same memory but faster on all the same inputs, or same speed but lower memory, or same speed and memory but better distribution).

Naming implementations after their low level implementation details (rather than high-level capabilities or tradeoffs) may lead to a situation in which we really regret adding a type to the stdlib, and really want people to stop using it, and at that point we're stuck if we care about retaining compatibility for existing code.

view this post on Zulip Brendan Hansknecht (Mar 26 2022 at 05:14):

I'm not convinced C++' approach is in any way sound for a modern language

I totally agree. I use C++ because I have to and it matters for a lot of my work (rust would also be mostly valid). I personally would love something akin to zig++ where it is C++ but with a clean start that is willing to delete and simplify. Zig is great, but I am not a big fan of C, so I don't really want the best version of C.

That being said, I don't really agree with these two statements:

even when we could be confident that smarter compilers would obsolete the value of that low level control later
but the core language or libraries should probably not express distinctions which are relevant to performance unless such details also better clarify intent

Though compilers are getting better and smarter, there are many cases where either it costs to much to run an optimization (try n types of HashMap and pick the best) or the optimizations are outside the scope of the compiler (you wrote your logic like x, let me fundamentally understand that and write it with a new algorithm).

Due to these kinds of limitations, there will always be a reason to care about the low level details no matter how good the compiler gets. Using extra cores or memory when scaled to data centers has huge monetary and environmental costs. So saving 0.1% of cpu usage might actually really matter for some operations.

view this post on Zulip Brendan Hansknecht (Mar 26 2022 at 05:18):

I need _pretty good_ performance and _pretty reasonable_ overhead, but once that's obtained, prefer simplicity above optimal performance or other concerns.

I think for probably 95+% of code, this is the case. I think it is always good to keep that in mind especially since Roc is linked with a platform and the platform can do low level things if need (though it has more overhead than calling a builtin).
Also, in general, I am worried that analyzing every list or map or etc to potentially swap out the implementation will have very high compile time costs that we likely don't want.

view this post on Zulip Brendan Hansknecht (Mar 26 2022 at 05:19):

However, in my experience, these specialized implementations should probably not be named WyHash or FnvHash or whatever, but rather named after their general tradeoffs, and only document that "right now this uses Fnv."

100% agree. Unless we want to keep it around forever we need to be careful of these things. Especially since we might even want to change out the hash depending on the type.

view this post on Zulip Brendan Hansknecht (Mar 26 2022 at 05:20):

Roc does not want to be a kitchen sink like C++

view this post on Zulip Brendan Hansknecht (Mar 26 2022 at 05:21):

So we should make sure that whatever we do, we are committing to behaviour, not implementation.

view this post on Zulip Giovanni Zanandrea (Mar 26 2022 at 11:21):

Kevin Gillette said:

However, in my experience, these specialized implementations should probably not be named WyHash or FnvHash or whatever, but rather named after their general tradeoffs, and only document that "right now this uses Fnv." Remember that md5 was considered pretty good and general purpose at one point (albeit an entirely different kind of hashing function). More to the point, various bespoke or temporarily well-regarded dictionary hashing algorithms, that turn out to have distribution problems or pathological cases, have also come and gone over the years; or if they were actually pretty good, then regularly enough something that is strictly better at achieving the same tradeoffs comes along (same memory but faster on all the same inputs, or same speed but lower memory, or same speed and memory but better distribution).

But at that point, wouldn't it be preferrable to specify the hashing algorithm as a compiler option? We could have a configuration file where we declaratively control all the things that can introduce nondeterminism in the language. Not just hashing algorithms, but also things like Num.appSeed, which Brendan Hansknecht mentioned in another post. The developer would still have a pretty good degree of control over these things, and we wouldn't have to worry about backward compatibility because newer versions of the compiler could keep supporting the old configuration options while adding new ones.

view this post on Zulip Giovanni Zanandrea (Mar 26 2022 at 11:22):

Brendan Hansknecht said:

There are observable differences between the different versions of the compiler, but for any given version you cannot tell how a collection is implemented.

So if someone switches from the Java version to the C++ version, they might not be able to directly port their Java code to C++ because they need to deal with differences between the Java and C++ version of the compiler?

Correct. I think it would be trivial to maintain full compatibility between the C/C++ code generator and a (hypothetical) one that generates Zip or (unsafe) Rust code without affecting performance in any way, because all those languages give you full control over what the machine is doing. But compiling to a completely alien data model like Java's creates a lot of additional difficulties, and something has to give.

But as I said in a previous post, I've never had a bug that showed up in one of the implementations but not the others (and the compiler is self-hosted, so I've written plenty of code in my own language). The real issue is divergent but still valid outputs, which is a testability nightmare.

view this post on Zulip Giovanni Zanandrea (Mar 26 2022 at 11:23):

Brendan Hansknecht said:

Your record example is very intriguing. How do you actually decide when to swap implementation and not use a struct?

Right now I just optimize all tagged records that appear in a type definition. The type system I use can't do type inference anyway, and beside I like to explicitly define my data structures, so that is good enough for me. But you could also use other criteria as well.

view this post on Zulip Giovanni Zanandrea (Mar 26 2022 at 11:24):

Brendan Hansknecht said:

We'll have to agree to disagree on this one. I believe that the job of a high level language is to work out the implementation details for you as much as possible.

Yeah, as I said, I tend to work pretty low level, so I have a lot of bias. I just generally have found being lower and having more control was worth it for the performance. I think a lot of this discussion depends on how high level Roc actually is. I want to use Roc as a low level langauge. I want to be able to use it were I would currently reach for Rust or C++. With that in mind, I don't want to heavily depend on compiler heuristics for performance or pay the cost of "under the hood optimizations" that I don't actually need. So I understand both sides, but my want for "high level" is much much lower than your form of "high level".

On the other hand, with something high level like python or javascript or SQL, I would gladly take your optimizations as long as they aren't too slow to compile/interpret.

I'd like to point out that even when compiling to Java (and as I said, Java is a dreadful compilation target in a case like mine) my language is already competitive with OO Java and often faster than OO C# for data/state-intensive application (some benchmarks here). With a few additional optimization, which are not too hard to implement, I think the C++ code generator will be able to clearly outperform Java in that domain. So it's nowhere in the same league as python or ruby, let alone SQL, when it comes to performance.

Obviously, there are algorithms that are hard to implement efficiently in a referentially transparent language (even one like mine that allows some local mutability), so Java will have an edge there. But there might be a way to combine the relational model with a ROC-like language that does opportunistic mutation, and solve even that problem.

And the relational model offers opportunities for optimizations that I haven't even began to take advantage of. They would require the addition of a declarative Datalog-like language, which at the moment I don't have. But with that in place, I can think of use cases where it would be possible to provide a level of performance that would be very difficult in practice to achieve even in C. Not impossible of course, since C is basically glorified machine language. But it would take an awful lot of work to replicate what in a declarative language could be done automatically by the compiler.

Compilation speed is indeed a bit of an issue right now, but the main bottleneck is the structural type system. Optimization of data structures is actually pretty fast.

view this post on Zulip Giovanni Zanandrea (Mar 26 2022 at 11:25):

Brendan Hansknecht said:

ROC itself is probably not going to completely eliminate the need for languages like C/C++/Rust/Zig

Roc depends on C/C++/Rust/Zig. You can't compile a Roc app without a platform, and a platform has to be written in a language like those. As I allude to earlier, my hope would just be that roc is essentially as fast and only slightly higher level than those languages, but while being a pure functional programming language. That will be had to do at scale, but I think the project will manage this for the most part.

If this "optimization by opportunistic mutation" thing pans out I suspect it could be a real game changer for functional programming. I really hope it works.

view this post on Zulip Brendan Hansknecht (Mar 26 2022 at 14:38):

So it's nowhere in the same league as python or ruby, let alone SQL, when it comes to performance.

Sorry if my comment came off as attacking the speed of your language. I was trying to say that the higher level a language is the more ok I am with it swapping things out under the hood. As you move up the stack you are giving up control for convenience. Your language looks super convenient for it's use case, but the end user has very little control over final assembly executed.

view this post on Zulip Giovanni Zanandrea (Mar 27 2022 at 08:01):

Brendan Hansknecht said:

Sorry if my comment came off as attacking the speed of your language.

No worries, your comment did not come off as an attack and I didn't take it as such. I just wanted to point out that high-levelness and good performance are perfectly compatible


Last updated: Jun 16 2026 at 16:19 UTC