Hash Maps & Sets · ideas · Zulip Chat Archive

Honestly, I'm excited for abilities to be implemented in the compiler so that I can start testing out building types using abilities to see what kind of performance we get

Brendan Hansknecht (Mar 07 2022 at 20:52):

Brendan Hansknecht (Mar 07 2022 at 20:53):

Brendan Hansknecht (Mar 07 2022 at 20:54):

So naming and generic behavior, but today you could just write n versions of the function for different types.

Derek Gustafson (Mar 07 2022 at 20:54):

Having used mutable vectors in Haskell, I was really impressed with quicksort in Roc, and want to see what other pathological algorithms can be implemented without sacrificing performance

Brendan Hansknecht (Mar 07 2022 at 20:55):

Derek Gustafson (Mar 07 2022 at 20:55):

Derek Gustafson (Mar 07 2022 at 20:56):

Brendan Hansknecht (Mar 07 2022 at 20:58):

Ah, sure, and that is for the generic nature of it. Theoretically someone today could write DictIntStr, but it would be specific instead of generic. If written in roc, it would be almost identical to the final Dict a b.

Derek Gustafson (Mar 07 2022 at 21:00):

Brendan Hansknecht (Mar 07 2022 at 21:01):

Ayaz Hafiz (Mar 07 2022 at 21:02):

Derek Gustafson (Mar 07 2022 at 21:02):

Brendan Hansknecht (Mar 07 2022 at 21:02):

Derek Gustafson (Mar 07 2022 at 21:03):

Brendan Hansknecht (Mar 07 2022 at 21:05):

To be useful, that is probably true, though I don't think it is technically required.

Derek Gustafson (Mar 07 2022 at 21:06):

Brendan Hansknecht (Mar 07 2022 at 21:07):

Also, since we just have a generic supports Hashing, does that mean that each type essentially would have to implement it's own hashing algorithm? Like could one type implement a different hashing algorithm than another and would that cause problems?

Brendan Hansknecht (Mar 07 2022 at 21:08):

I guess in reality most users would just call the hash method for builtin types plus a hash combining method, so they would all use the same underlying hash algorithm in the end.

Derek Gustafson (Mar 07 2022 at 21:09):

My gut reaction is that you wouldn't compare the hashes from two different types.

Brendan Hansknecht (Mar 07 2022 at 21:09):

Ayaz Hafiz (Mar 07 2022 at 21:09):

also worth noting that the output of the hash would be an opaque 64 bits, and that you could only define custom hashes for opaque types. so the hash algorithm is less of a concern since it’s opaque in two ways

Richard Feldman (Mar 07 2022 at 21:10):

I'd be very interested in seeing what those could look like in user space - @Derek Gustafson is that something you're interested in collaborating on?

Derek Gustafson (Mar 07 2022 at 21:11):

Derek Gustafson (Mar 07 2022 at 21:12):

Derek Gustafson (Mar 07 2022 at 21:13):

Brendan Hansknecht (Mar 07 2022 at 21:18):

I would be up helping to make a user space version. We could either wait for abilities or implement something before abilities are in by hardcoding the input hashable type.

Derek Gustafson (Mar 07 2022 at 21:22):

You folks have made me even more excited for this. If no one's started it by then, I'll probably give it a go tomorrow night

Ayaz Hafiz (Mar 07 2022 at 21:23):

realistically we are at least 4-5 weeks away from being able to hash builtin types via abilities. i hope to get custom type abilities in sooner (at which point i think we could massively parallelize work in migrating the standard library to abilities, which would be cool if folks are in). but im giving a very aggressive timeline and it shouldn’t really impact the implementation of dict/set other than at the api surface

Brendan Hansknecht (Mar 07 2022 at 21:24):

Brendan Hansknecht (Mar 07 2022 at 21:25):

We can just start with ints as the keys and xoring as the hash. That should be enough for writing most of the core of an actual hash map.

Richard Feldman (Mar 07 2022 at 21:37):

yeah I think you could do all the time-consuming parts without abilities actually being implemented yet

Richard Feldman (Mar 07 2022 at 21:37):

like make an IntDict, StrDict, etc. and write the hash implementations for each of them

Richard Feldman (Mar 07 2022 at 21:37):

Richard Feldman (Mar 07 2022 at 21:38):

Folkert de Vries (Mar 07 2022 at 21:43):

those are just red-black trees. Our examples in /benchmarks have some code based on Lean benchmarks testing various RBTree functions

Folkert de Vries (Mar 07 2022 at 21:43):

Richard Feldman (Mar 07 2022 at 21:46):

oh sorry I meant from an API perspective, not from an implementation perspective :big_smile:

Richard Feldman (Mar 07 2022 at 21:46):

Richard Feldman (Mar 07 2022 at 21:57):

one interesting question to begin with is: what should the API be for making a dictionary from a known list of key/value pairs?

Richard Feldman (Mar 07 2022 at 21:58):

Dict.empty
    |> Dict.insert "foo" foo
    |> Dict.insert "bar" bar
    |> Dict.insert "baz" baz

Richard Feldman (Mar 07 2022 at 21:58):

Dict.fromList [
    { k: "foo", v: foo },
    { k: "bar", v: bar },
    { k: "baz", v: baz },
]

Richard Feldman (Mar 07 2022 at 21:59):

although by default the performance would be worse in the latter case because it would have to allocate an intermediate list and then throw it away

Richard Feldman (Mar 07 2022 at 21:59):

although there are interesting theoretical potential optimizations the compiler could do someday

Richard Feldman (Mar 07 2022 at 21:59):

e.g. inlining plus "if all you ever do with a list is fold over it, then unroll the loop and just don't bother creating the list at all"

Richard Feldman (Mar 07 2022 at 21:59):

Richard Feldman (Mar 07 2022 at 22:00):

Ayaz Hafiz (Mar 07 2022 at 22:02):

yeah that seems like a prime opportunity for a deforestation optimization to shine

Richard Feldman (Mar 07 2022 at 22:21):

Ayaz Hafiz (Mar 07 2022 at 22:26):

Folkert de Vries (Mar 07 2022 at 22:39):

current mono IR makes that kinda hard, my idea for an improved mono IR makes it more feasible

Folkert de Vries (Mar 07 2022 at 22:40):

it allows you to walk up the tree of statements again, currently we can only go down

Folkert de Vries (Mar 07 2022 at 22:40):

Brendan Hansknecht (Mar 07 2022 at 23:02):

So I just realized something. I think that a hashmap needs to be refcounted like lists, otherwise it will do tons of copying and have terrible performance. So would that mean it needs to be a builtin?

Brendan Hansknecht (Mar 07 2022 at 23:02):

Oh wait....nvm. the underlying implementation of the hashmap will be a list, so it will have the refcounting.

Brendan Hansknecht (Mar 07 2022 at 23:03):

Ayaz Hafiz (Mar 07 2022 at 23:08):

Brendan Hansknecht (Mar 07 2022 at 23:10):

Will be interesting to see which work well with roc in general. Though also, there is the complexity of performance depending on the size of the type in the hashmap and other details like that.

Richard Feldman (Mar 07 2022 at 23:18):

Richard Feldman (Mar 07 2022 at 23:19):

for most of my career I've always reached for whatever dictionary/set data structure was most commonly used

Richard Feldman (Mar 07 2022 at 23:19):

but working on the Roc compiler I've started to see the performance drawbacks of doing that :sweat_smile:

Richard Feldman (Mar 07 2022 at 23:20):

so I kinda like the idea of not having a default dictionary that's encouraged to be the thing everyone uses

Richard Feldman (Mar 07 2022 at 23:20):

Brendan Hansknecht (Mar 08 2022 at 00:11):

I agree except that I think it should be a recommend default with explicit guidelines around when not to use it.

I think that many users don't have the expertise to pick the correct map or necessarily want to dig into it.

Brendan Hansknecht (Mar 08 2022 at 00:17):

Expertise may not be the best wording. Many programmers could follow rules of thumbs to pick the right map, but that doesn't mean they really understand the choice or exact tradeoffs.

Jared Cone (Mar 08 2022 at 00:32):

Recommended default will also increase compatibility between libraries. Having a standard makes it easier to pass data around without having to do conversions.

Richard Feldman (Mar 08 2022 at 01:31):

true, although it worries me a bit if the standard is something other than List

Richard Feldman (Mar 08 2022 at 01:31):

Richard Feldman (Mar 08 2022 at 01:32):

Richard Feldman (Mar 08 2022 at 01:58):

Brendan Hansknecht (Mar 08 2022 at 02:46):

Richard Feldman (Mar 08 2022 at 03:16):

Richard Feldman (Mar 08 2022 at 03:17):

another is Dict Str Header (where Header perhaps does not include the name, and that's just the key in the dict)

Richard Feldman (Mar 08 2022 at 03:17):

a third answer is Headers which has a Headers.get : Str -> Result Header [ HeaderNotFound ]*

Richard Feldman (Mar 08 2022 at 03:17):

it's reasonably likely that of the three designs, the Dict one would have the worst performance

Richard Feldman (Mar 08 2022 at 03:18):

the Headers one lets you abstract away the internal representation, which might be the best

Richard Feldman (Mar 08 2022 at 03:19):

Richard Feldman (Mar 08 2022 at 03:20):

there's no real need for (say) a Dict being used in a CLI to include hash flooding mitigations, because DoSing a CLI isn't really a thing :big_smile:

Richard Feldman (Mar 08 2022 at 03:20):

so this would be an example where I think maybe having a default Dict everyone is accustomed to reaching for might lead to a worse outcome

Brendan Hansknecht (Mar 08 2022 at 03:55):

So, if we don't have a default, I would expect the average roc user to search the package registry for Map, Dict, or HashMap, and then just take the first package that has the most use overall. I don't think most users that want a Dict tend to think about the tradeoffs. They simply want a well implemented key value store.

I think adding a default helps to defend against people picking the wrong dictionary. If we provide a well documented default that has explicit documentation around when it won't work as well, that will help users switch to other dictionaries in those cases. So the documentation would look something like this:

Then, when a user goes to the default Dict, they will learn about hash flooding and have an alternatives to consider. Without this guidance, many users are even more likely to pick the wrong map. I think that a default with guidance is much more useful than no default at all.

Brendan Hansknecht (Mar 08 2022 at 03:56):

Brendan Hansknecht (Mar 08 2022 at 03:58):

As an extra note, this makes me really question the supports Hashing ability. The issue is that for many specialized Dicts, you will want to use a specific hashing algorithm. So unless support Hashing takes a list of primitive hashing functions or something of that nature, you won't be able to swap out the algorithm.

Richard Feldman (Mar 08 2022 at 03:59):

Brendan Hansknecht (Mar 08 2022 at 03:59):

Also, If someone makes an HTTP library, I would hope that they can just expose Headers.get as you mentioned. Then they can pick whatever Dict or not that they see as suitable. End users would never have to think about it.

Richard Feldman (Mar 08 2022 at 04:00):

it just provides a way to specify "given a hashing function, here's a way to incorporate all my data into it"

Richard Feldman (Mar 08 2022 at 04:01):

yeah, so maybe the answer is to have a roc/dict package that's a reasonable default, but also provide guidance to platform authors not to use it for HTTP headers :big_smile:

Brendan Hansknecht (Mar 08 2022 at 04:02):

Could a dict be parameterized on the hashing algorithm? I think so. If that is the case, then it may be possible to make a resistant dict pretty easily as an expansion on top of the standard dict. Though the paramterization may not be zero cost like in rust, so that may be problematic.

Richard Feldman (Mar 08 2022 at 04:05):

oh I think defending against hash flooding should be done at the hash map implementation level, not the hashing algorithm level

Richard Feldman (Mar 08 2022 at 04:06):

Brendan Hansknecht (Mar 08 2022 at 04:07):

I guess it depends the level of defense you want/need. In that case, it could still be a different package. I would expect at least 4 different Dict implementations that are relatively common. linear list based key value parse, some sort of flat hash map, some sort of node hash map, and some sort of flooding resistant hash map.

Richard Feldman (Mar 08 2022 at 04:08):

also, the only flooding-resistant hashing algorithms I know of require side effects for randomness

Brendan Hansknecht (Mar 08 2022 at 04:08):

Brendan Hansknecht (Mar 08 2022 at 04:09):

Richard Feldman (Mar 08 2022 at 04:09):

Richard Feldman (Mar 08 2022 at 04:10):

I guess you could hardcode it as part of the ability implementation and make sure to keep it secret from attackers :thinking:

Brendan Hansknecht (Mar 08 2022 at 04:10):

For sure, but if it is a user space library, it can either wrap a regular dict and store that, or it can make its own version and store that. Rather small cost.

Brendan Hansknecht (Mar 08 2022 at 04:10):

Richard Feldman (Mar 08 2022 at 04:11):

fair enough, but it's both more ergonomic and more performant to have the hash map itself implement the flooding mitigations :big_smile:

Richard Feldman (Mar 08 2022 at 04:12):

I am perhaps unreasonably excited about that idea because it seems like a very clear-cut case where every mainstream language is doing something suboptimal in multiple ways, and I like the idea of doing it better and advancing the state of the art :smiley:

Brendan Hansknecht (Mar 08 2022 at 04:15):

Even most c++ standard libraries do this horribly. Default int hashing algorithm of do nothing.

Richard Feldman (Mar 08 2022 at 04:16):

Brendan Hansknecht (Mar 08 2022 at 04:17):

Also, I am definitely concerned that we won't be able to implement a max performance Dict in roc itself. Most of them at some point or another depend on specific assembly instructions for bit twiddling and performance.

Brendan Hansknecht (Mar 08 2022 at 04:18):

But still I think this will be a fun project and interesting to see how far it can be pushed in Roc itself.

Richard Feldman (Mar 08 2022 at 04:20):

Brendan Hansknecht (Mar 08 2022 at 04:20):

In which sense, Roc implementation performance will be good enough or that we can fallback to other things if it is too slow.

Richard Feldman (Mar 08 2022 at 04:21):

I suspect at the point where an absolute max performance dict is really a must-have, using a language like Roc at all is probably off the table anyway

Brendan Hansknecht (Mar 08 2022 at 04:22):

Richard Feldman (Mar 08 2022 at 04:22):

and I suspect we can be competitive with other languages with automatic memory management like Go, Java, C#

Richard Feldman (Mar 08 2022 at 04:22):

Brendan Hansknecht (Mar 08 2022 at 04:22):

Richard Feldman (Mar 08 2022 at 04:22):

I'm definitely open to the idea that maybe we should have a Dict as a builtin for performance reasons, but I suspect it won't be necessary in practice :big_smile:

Richard Feldman (Mar 08 2022 at 04:23):

and one big reason I like not having it as a builtin is that it means we don't have to do a language-level breaking change if we ever want to change the hashing algorithm

Richard Feldman (Mar 08 2022 at 04:25):

because if there's one thing in programming that is an absolute unwavering certainty, it's that wherever there are hash maps, there are programs whose correctness unintentionally depends on the ordering that happens to be generated by their particular hashing functions

Richard Feldman (Mar 08 2022 at 04:26):

also people keep coming out with new hashing functions that are faster than the old ones, so there's motivation to upgrade too :big_smile:

Brendan Hansknecht (Mar 08 2022 at 04:27):

Might be a good idea to always take a random seed so that iteration can have random order so that users can't depend on it.

Brendan Hansknecht (Mar 08 2022 at 04:27):

Richard Feldman (Mar 08 2022 at 04:28):

Richard Feldman (Mar 08 2022 at 04:29):

but yeah, I think it's much more okay for a package to say "here's a new version of this collection, it uses a new hashing algorithm; update to it only when you're ready" than for a language to do it

Richard Feldman (Mar 08 2022 at 04:30):

because in the package case, you can always continue using the old version in any places that break with the new algorithm, and fix them (or not) at your leisure

Richard Feldman (Mar 08 2022 at 04:30):

whereas at the language level, you're blocked on using the new language release until you can address the bugs from the changes, which is a much bigger concern!

Richard Feldman (Mar 08 2022 at 04:31):

that in turn creates an incentive for the language never to change its builtin hashing function, whereas packages can do it without that problem

Brendan Hansknecht (Mar 08 2022 at 04:33):

Derek Gustafson (Mar 08 2022 at 11:19):

I believe the reason Python chose the identity for it's integer hash function is that the common use case for int keys is using a dict as an array, and they get better average performance that way

Derek Gustafson (Mar 08 2022 at 11:36):

I had a thought. What if we remove iteration from the hashmap API. We could have a Red Black map that provides the full API, and comments about memory usage and performance being better for hashmaps, at the cost of the API

Derek Gustafson (Mar 08 2022 at 22:29):

I've opened a WIP PR for HashMaps written in Roc. At the moment it's basically just the API. I would appreciate people taking a look.
Are there names that could be better?
Is the order of arguments consistent with Roc's philosophy on the piping operator?

Jared Cone (Mar 08 2022 at 22:56):

All the names make sense to me at least. I hesitated with len for hashmap. I can see it being used with List or Array because they are contiguous and linear, but stuck out to me in the context of a hashmap. Maybe I'm just used to Num or Count with them. But since List already establishes len I think it makes sense to stick with that.

Folkert de Vries (Mar 08 2022 at 22:56):

is not exposing things like toList really a solution? doesn't that take away much of the value of these data structures?

Folkert de Vries (Mar 08 2022 at 22:57):

Jared Cone (Mar 08 2022 at 22:58):

Derek Gustafson (Mar 08 2022 at 23:04):

Derek Gustafson (Mar 08 2022 at 23:06):

@Folkert de Vries Having watched the Python2/3 divide, I'm worried that creating a barrier to update (if the hash function is updated) is asking for a divide in Roc's userbase.

Derek Gustafson (Mar 08 2022 at 23:09):

@Folkert de Vries There might be an argument for a second Map package, OrderedMap. If you need the asymptotic performance use HashMap; If you want to be able to convert the Map to a List, use the OrderedMap.

Folkert de Vries (Mar 08 2022 at 23:10):

Derek Gustafson (Mar 08 2022 at 23:12):

Folkert de Vries (Mar 08 2022 at 23:12):

Derek Gustafson (Mar 08 2022 at 23:14):

Folkert de Vries (Mar 08 2022 at 23:15):

Folkert de Vries (Mar 08 2022 at 23:16):

in any case, for a basic data structure, I want to be able to get my hands on the data again. That's the whole point

Folkert de Vries (Mar 08 2022 at 23:17):

btw rust fixes this by parameterizing over the hasher, which is a possibility here

Derek Gustafson (Mar 08 2022 at 23:19):

Derek Gustafson (Mar 08 2022 at 23:21):

Folkert de Vries (Mar 08 2022 at 23:21):

Folkert de Vries (Mar 08 2022 at 23:22):

Richard Feldman (Mar 08 2022 at 23:25):

another idea would be to name the data structure after its hashing function, e.g. call it FnvMap if it uses FNV-1a

Folkert de Vries (Mar 08 2022 at 23:25):

# LowLevel.HashMap.roc

HashMap hasher key value : ...

empty : hasher -> HashMap hasker key value
empty = \hasher -> ...

# HashMap.roc

HashMap key value : LowLevel.HashMap.HashMap DefaultHasher key value

empty : HashMap key value
empty = LowLevel.HashMap.empty defaultHasher

Richard Feldman (Mar 08 2022 at 23:25):

Folkert de Vries (Mar 08 2022 at 23:26):

the answer to "I want to map keys to values with deduplication" should be simple

Folkert de Vries (Mar 08 2022 at 23:27):

"let's learn about hashing" isn't all that much better than "let's learn about monads"

Richard Feldman (Mar 08 2022 at 23:27):

of course, a plausble answer to that is AssocList which doesn't even use hashing :wink:

Folkert de Vries (Mar 08 2022 at 23:29):

AssocList is weird because, yes it is the simple solution, but then it has horrible big-O complexity (which we all learn is bad), but then in practice the performance can actually be better than the structures with better big-O

Folkert de Vries (Mar 08 2022 at 23:29):

Derek Gustafson (Mar 08 2022 at 23:35):

I feel like this brings us back to what I was thinking. No toList for HashMap, but we could have an assocList with a map API for if you want to convert to a list.

Folkert de Vries (Mar 08 2022 at 23:37):

again I give you my example of you use some random package that gives you one of these closed data types, and you're just stuck.

Folkert de Vries (Mar 08 2022 at 23:38):

I mean, it's fine for an experiment for now, but that's not what I'd want the "chosen one" semi-official key-value map implementation to look like

Richard Feldman (Mar 08 2022 at 23:46):

Richard Feldman (Mar 08 2022 at 23:47):

related: could something with an Ordering constraint be faster than hash maps for small collections and still reasonable (but slower than hash maps) for large ones?

Richard Feldman (Mar 08 2022 at 23:48):

Folkert de Vries (Mar 08 2022 at 23:48):

Folkert de Vries (Mar 08 2022 at 23:49):

Folkert de Vries (Mar 08 2022 at 23:50):

Derek Gustafson (Mar 08 2022 at 23:53):

Folkert de Vries (Mar 08 2022 at 23:55):

yes but you're chasing pointers on the heap. We've seen linear scans of an array of keys win for even very large sets/maps

Folkert de Vries (Mar 08 2022 at 23:55):

Folkert de Vries (Mar 08 2022 at 23:56):

if your key is some more complex structure linear scan isn't as good, but still will win up to surprisingly large collections

Derek Gustafson (Mar 08 2022 at 23:56):

In order to write the HashMap in Roc, it was going to be a Hash Array mapped tree. Which would also require chasing pointers

Folkert de Vries (Mar 08 2022 at 23:56):

Derek Gustafson (Mar 08 2022 at 23:56):

Derek Gustafson (Mar 08 2022 at 23:57):

I was looking at Haskell's implementation. They claimed base 16, but I think it was actually 32

Folkert de Vries (Mar 08 2022 at 23:57):

and the size of the array is usually kinda large I believe, like 32 or 64 elements?

Derek Gustafson (Mar 08 2022 at 23:58):

Folkert de Vries (Mar 08 2022 at 23:58):

Richard Feldman (Mar 09 2022 at 00:25):

Richard Feldman (Mar 09 2022 at 00:27):

Richard Feldman (Mar 09 2022 at 00:36):

another thing to consider: something I've seen requested in other languages (but have never personally used) is a request to have iteration follow the order in which the entries were originally inserted

Richard Feldman (Mar 09 2022 at 00:37):

I've heard when Python switched to guaranteeing this, there was no performance impact

Richard Feldman (Mar 09 2022 at 00:38):

presumably the way they would have done that is to maintain a separate array of keys in insertion order

Richard Feldman (Mar 09 2022 at 00:38):

I guess if they wanted to make removals O(1) they'd also need a separate hash map from keys to indices in the insertion order array :sweat_smile:

Richard Feldman (Mar 09 2022 at 00:39):

but I suppose that's worth considering - if you have "hash map except it preserves insertion order," then it's strictly slower than an unordered hash map, but you can change the algorithm at will

Richard Feldman (Mar 09 2022 at 00:40):

Richard Feldman (Mar 09 2022 at 00:41):

plus you could still do the "associative array list at small sizes" thing, which would also maintain insertion order of course - and would presumably be faster than a small hash map

Brendan Hansknecht (Mar 09 2022 at 00:48):

Just a note, I am going to try and implement something quite similar to the absl::flat_hash_map directly in roc. I think it should be doable.

Brendan Hansknecht (Mar 09 2022 at 00:53):

Personally, I don't think removing toList is the solution here. Packages are not part of the standard library and will have versioning. If a user depends on an implementation detail, they may get hurt when the package update. That is completely fine. The user can choose not to upgrade. Personally, I am debating using the version of the library as a seed for all hashing so that every update the order is randomize no matter that hashing function chosen.

Brendan Hansknecht (Mar 09 2022 at 00:54):

Brendan Hansknecht (Mar 09 2022 at 00:57):

Python dict is really slow to begin with, so having no performance impact may be relative?

Derek Gustafson (Mar 09 2022 at 00:59):

Probably because I'm still trying to wrap my head around the mutable functional programming that Roc has created

Brendan Hansknecht (Mar 09 2022 at 01:00):

@Folkert de Vries Do you think hat taking the hasher would be optimized well in Roc. I am worried that unlike rust or C++ where it is compile time optimized away that in roc it would leave in a lot of switches and function calls. I assume the hasher would just be a struct of closures that know how to hash every primitive.

Richard Feldman (Mar 09 2022 at 01:01):

@Derek Gustafson the way we've been designing everything (which has worked out so far!) is basically to assume everything is going to be mutated in place at runtime

Richard Feldman (Mar 09 2022 at 01:01):

Richard Feldman (Mar 09 2022 at 01:04):

like for example, assume that what List.set will do is mutate the given list in place and then return the mutated one

Richard Feldman (Mar 09 2022 at 01:05):

and then be careful not to defeat that optimization unnecessarily, by (for example) storing the same list in two places, causing it to have a reference count greater than 1

Richard Feldman (Mar 09 2022 at 01:07):

but even if that does happen, it may just get cloned once (by List.set for example, which will see that it has a refcount > 1 and therefore ineligible for in-place mutation) and then after the cloning, you end up with two distinct lists, one of which has been mutated, and both of which now have a refcount of 1

Richard Feldman (Mar 09 2022 at 01:07):

Derek Gustafson (Mar 09 2022 at 01:09):

@Richard Feldman Yeah, I know. That's what makes me so excited for Roc. But the intuition isn't there yet

Brendan Hansknecht (Mar 09 2022 at 01:11):

Also, do we have anyway to make a default or dummy element with making a tag union and wasting space? For List backed dictionaries, they will have plenty of empty slots that require some sort of value.

Brendan Hansknecht (Mar 09 2022 at 01:11):

I don't necessarily want a user to have to pass in a default element that we store.

Brendan Hansknecht (Mar 09 2022 at 01:14):

My current idea to avoid this is to just duplicate existing items in order to avoid need a default.

Derek Gustafson (Mar 09 2022 at 01:36):

@Richard Feldman Just took a look at Python's Dict implementation. They have a hash table of indices into an array of key/value pairs.

Richard Feldman (Mar 09 2022 at 01:52):

interesting! To Brendan's point earlier, I wonder how they store deleted entries :thinking:

Richard Feldman (Mar 09 2022 at 01:54):

clearly they aren't just iterating directly over the k/v pair array, since deletions may have removed entries :big_smile:

Richard Feldman (Mar 09 2022 at 01:56):

actually come to think of it, we do already have an optimization that stores tag union discriminants in pointer bits if possible

Richard Feldman (Mar 09 2022 at 01:56):

so if you had a key or value that had a pointer in it, I think just wrapping it in [ Entry key value, Removed ] would actually not take up any more space

Richard Feldman (Mar 09 2022 at 01:57):

but it would need extra space if both key and value were integers, for example

Richard Feldman (Mar 09 2022 at 01:57):

double pointer dereference for each lookup is not ideal performance-wise :sweat_smile:

Derek Gustafson (Mar 09 2022 at 02:00):

Considering what goes into executing a single bytecode in Python, I don't think double dereferncing is really a concern

Richard Feldman (Mar 09 2022 at 02:03):

Richard Feldman (Mar 09 2022 at 02:04):

Derek Gustafson (Mar 09 2022 at 02:06):

Just dug deeper. The use Null pointers for deleted entries, and skip over them when iterating.

Derek Gustafson (Mar 09 2022 at 02:09):

Richard Feldman (Mar 09 2022 at 02:17):

Richard Feldman (Mar 09 2022 at 02:48):

another important consideration about AssocList and friends: a common use for dictionaries is to have string keys

Richard Feldman (Mar 09 2022 at 02:49):

if they're small strings, that might not be so bad, but if they're large strings (e.g. URLs and email addresses are often over 24B) then there'll be lots of pointer dereferencing just to complete the equality checks :grimacing:

Richard Feldman (Mar 09 2022 at 02:49):

Derek Gustafson (Mar 09 2022 at 02:50):

Richard Feldman (Mar 09 2022 at 02:53):

Richard Feldman (Mar 09 2022 at 02:54):

although given the complexity that would introduce to a lot of other use cases, I'd default to thinking that string keys are a point in favor of an ordinary hashmap :big_smile:

Derek Gustafson (Mar 09 2022 at 02:58):

So, back to what started this whole discussion today: Should there be a toList function? If so, what's it's order?

Brendan Hansknecht (Mar 09 2022 at 03:08):

I would vote for yes and random order, but in general, it is something that can be worried about later. Because much more important of the library can not be built in roc and is written is zig or directly in ir.

Brendan Hansknecht (Mar 09 2022 at 03:09):

Separate list that has a bit to spare and helps make lookups faster by reducing pointer dereferencing.

Richard Feldman (Mar 09 2022 at 03:09):

Derek Gustafson (Mar 09 2022 at 03:10):

Brendan Hansknecht (Mar 09 2022 at 03:11):

Essentially whenever I talk about specific implementation I tend to base off of absl::flat_hash_map, but when I talk about design, I am talking about the theoretical roc version.

Richard Feldman (Mar 09 2022 at 03:12):

Richard Feldman (Mar 09 2022 at 03:13):

and because of that last point, the advice would be "try not to rely on the ordering, because if you do, it'll make upgrading painful in the future"

Brendan Hansknecht (Mar 09 2022 at 03:13):

Also, my biggest issue with the python dict is that there hash functions are terrible which naturally leads to more collisions which slows things down even more.

Richard Feldman (Mar 09 2022 at 03:22):

Richard Feldman (Mar 09 2022 at 03:23):

e.g. I noticed wyhash calls count_ones, which we don't currently have a builtin for

Jared Cone (Mar 09 2022 at 03:25):

Would any runtime operations on the dict change the ordering? Like, dict with (a,b) toList produces [a,b]. But then you add c and now toList produces [b,a,c]

Brendan Hansknecht (Mar 09 2022 at 03:25):

Jared Cone (Mar 09 2022 at 03:33):

Yea I would think that would be the more prevalent issue that people would run into as opposed to depending on ordering being stable between versions

Jared Cone (Mar 09 2022 at 03:35):

Brendan Hansknecht (Mar 09 2022 at 03:38):

This makes me think of how for debug builds and tests, google now randomize the order of elements that have the same value when sorting. It stops people from depending on it and enables them to upgrade the algorithm. Of course, you can specifically call a stable sort method if you need that behavior, but if you don't specify needing the behavior, they don't want you to depend on implementation details of the sort.

Brendan Hansknecht (Mar 09 2022 at 03:39):

I really wish this was possible to do in roc, but the only way to do it would be to take a random seed...which easily could be set to a non-random value.

Robin Heggelund Hansen (Mar 09 2022 at 08:04):

RRBT, HAMTs and AMT

RRBT is not a hash-map. It's an Array implementation. It's benefits over AMT is that it retains amortized constant time operations for operations that manipulate the left/front of the Array as well as concatination. An AMT implementation (like the one in Elm) effectively needs rebuilding the tree for such operations. It does come at a slight bit of overhead though, but it shouldn't be to terrible.

While RRBT, HAMT and AMT are today known as immutable collections, they weren't designed that way from the offset. Phillip Bagwell (inventor of HAMT and co-inventor of RRBT) initially designed the HAMT as a mutate-in-place datastructure which used a tree structure for easier scalability (didn't require rebalancing, and so didn't have the same worst-time memory requirements).

Besides, any batch operations on such data structures (like map, or batch insert) would take advantage of mutability like what you've got in Roc.

The extra book-keeping to retain insertion order has some overhead, but it's still faster than Elm's Left Leaning RRBT.

Python's insert-order dict

Python retains insertion order at no performance expense _compared to Python's earlier implementation_, which is an important point to understand.

The way deletions are implemented is that you insert a tombstone for a particular key. A tombstone is just an object that can easily be recognized as a tombstone, and whenever a tombstone is encountered, the implementation acts if it found no value. Easy to do without performance overhead in a dynamic language, harder in ML. Though I guess Roc's tags might help with that?

Anyway, at some point, the hashmap has to be rebuilt (either to accommodate more keys, or simply to prevent an overflow of tombstones), at which point the tombstones are filtered out. A regular HAMT, in comparison, doesn't require rebuilding at set intervals.

Hashes and orderings

I belive the concern against relying on hashed-ordering is that it could change from one release of the std library to the next. I'd just like to say that this happens _very_ rarely in practice. Most languages don't change the hash that often, and new hashes don't show up that often either. It's an option to simply commit to a given hash-implementation, which many languages (like Java) do. Although, most languages also allow you to change the hash implementation being used :shrug:‍♂️

Efficient ordering-based map

The consensus right now is that you can't really do much better than an RRBT. _However_ in a language like Roc where you're not storing everything as a pointer, it might be worth considering just going for a 2-3 or 2-3-4 tree. It's more code, but less complexity, and you should be able to store 2-3 non-heap key-value pairs per node if you've got monomorphization. Doubt it'll be faster than a HAMT, though.

Kevin Gillette (Mar 09 2022 at 14:01):

@Richard Feldman I wonder if the naming of FnvMap and WyMap run counter to the goal of a high level language with tight performance: in my mind, that means I write very high level code and it happens to perform quite well (not necessarily "fastest ever," but fast enough that performance would not be a factor in choosing against Roc except in very niche cases). Even as a seasoned engineer, seeing FnvMap and WyMap as the choices means I need to read wikipedia and blog posts specifically comparing the two before I can make a decision, but I'd much rather just have some expert pick the generally right choice, give it a general name, and provide some loose guidance on "if your workload looks like X then consider choosing Y instead."

Beginner-friendliness might be something a clean language design provides "for free" without necessarily being an explicit goal, but echoing @Brendan Hansknecht, I believe that lacking a clear, good enough default for application authors to choose would make the language less usable and less approachable.

fwiw, having some kind of key-val mapping concept (whether or not it's a hashmap) is a fundamental need in programming, and a language feels hollow (to me) if it doesn't have a great implementation in the stdlib if not the language itself. Yeah, we should encourage people to use records when records fit the need, which is most of the time, but there's already precedence for a "here be dragons" syntax with sets in Roc, which use a list-like but not list-identical literal syntax. I'm certainly open to the argument that "any use of map literals should be using records instead," but there are regularly cases where one needs to initialize a mapping with a mix of static and dynamic values: requiring that to be a sequence of Map.add calls can obfuscate behind an algorithm what could have been data-declarative code.

Kevin Gillette (Mar 09 2022 at 14:29):

Does referential transparency mean "across time and space," or is that a related but distinct concept? I've taken it to just mean within the confines of a single process in practice, and that behavioral changes within different processes of the same binary, or across different binaries, while not ideal, do not technically break the property.

We could have platforms, such as an HTTP server platform, pass an opaque request context type to handler functions, such that the opaque type could be exchanged for a seed (thus not relying on the app author to generate their own seed, though still relying on them to not pass 0 as the seed). We could also have the platform provide a higher order data-initializer function(s), like:

init : PlatformOpaque -> (Seed -> a) -> a

Called like pf.init req Hash.new where req is the opaque type given by the platform which internally contains a seed.

In terms of flakey tests vs random seeds, we could have some test running mode that knows about seed-taking (or implicitly seed) types, and either runs it with a fixed seed for debugging, or runs the test (or relevant portions of the test, because referential transparency!) multiple times, or using test-compiled instrumentation (as context-aware fuzzers do) to ensure that those flakes are exposed on every test invocation rather than only occasionally. Combined with test result caching (don't rerun the test if nothing about the code has changed), this seems pretty solvable.

In Go's case, it appears to work pretty well, since the community doesn't regularly complain about flakey tests due to map iteration order, suggesting that issues are reliably uncovered very quickly (I've been in the Go community for over 10 years, so that's not a from-the-outside-in observation). I suspect giving Roc mappings an implicit seed would not lead to flakey tests in practice, especially with some degree of test support. That would also save us from needing to choose data structures or algorithms that guarantee insertion-order iteration or some other meaningful order.

Brendan Hansknecht (Mar 09 2022 at 16:46):

I think that naming the map is much worse than that. To maximize performance you likely need multiple types of maps as well. So it would be WyFlatMap and FvnFlatMap and WyNodeMap and ....

I 100% agree. Even though I am advocating for multiple map implementations, I think that basically all users should always be using something that is similar to absl::flat_hash_map. I just think that if it is written in Roc, we should definitely document alternatives/cases where you would want to pick a different map or hash function. I think that 95+% of users would just use the default map and be happy, but when someone hits and edge case, it tends to really matter.

Yeah, initialization is definitely nicer with full builtins. Though I guess it isn't much worse than Dict.fromList [ T k1 v1, T k2 v2, T k3 v3 ] or Set.fromList [ k1, k2, k3 ]

    dict = Map.empty {}
        |> Map.add "test" 124
        |> Map.add "foo" 7
        |> Map.add "bar" 12

This is what I have seen most hash maps do. It requires special fingerprinting algorithms if you need to read and write to disk.

This is a hard one. I think it is technically true, but it is hard for a lot of people when expecting certain things from a pure functional language. They generally expect if they compile the same app twice with the same version of compiler and libraries, they will get the same result. That is no longer true.

I would highly advise against a fix seed. People will just be confuse why their test passes but their actual code fails.

I used to use Go a lot, so I totally understand what you mean. I think that if Roc always gave random seed to maps and it was well known, that would fix the issues. Of course, it will probably also anger a lot of functional programmers because it feels very impure. I think it is just important that a implicit seed would need to be random and should not be some sort of static constant. Also, the reason that a random seed doesn't lead to flaky tests is that it is extremely unlikely to produce the same hardcode result twice. So the test will fail way more than it succeeds and stick out like a sore thumb. As an extra bonus, random seeds help to mitigate hash flooding.

Jumping back to this point now because I think it might have more complexities to talk about. A few first thoughts:

As an extra note, we could theoretically have a Roc constant like Num.appSeed : Seed. Where Seed := Nat and the seed only has a get method. Then anything that just needs a random initializer could take in Seed and use it. The appSeed is a constant that is random but generated at compile time (or maybe app startup time). For anything that needs more randomness, it then would still depend on getting it from the platform.

Kevin Gillette (Mar 09 2022 at 19:29):

I definitely agree that there is room and use for multiple map types, but we should be able to engineer a pretty good default choice, as you say, for the 95% case.

I was actually thinking Seed might just be another name for U64 (or U64 itself), and transparent, but that the platform would provide such a convenient way to get a seed that you'd just use the platform.

Brendan Hansknecht (Mar 09 2022 at 22:48):

I see. I guess that also works. It just means that anyone can pass a constant seed to the mapand then expect it to always have the same order, but if we update the hash map in some way, that may not be true. But if it was codified in most platforms, people should definitely see it coming when it randomly breaks if they use a constant seed over the platform generated one.

Richard Feldman (Mar 10 2022 at 03:43):

Richard Feldman (Mar 10 2022 at 03:44):

Brendan Hansknecht (Mar 10 2022 at 03:48):

Brendan Hansknecht (Mar 10 2022 at 03:49):

Otherwise, it looks to use hashbrown internally, so I think that it just adds an extra layer of indirection to make the ordering work, but this is on a super quick skim and could be off.

Brendan Hansknecht (Mar 10 2022 at 03:50):

Robin Heggelund Hansen (Mar 10 2022 at 12:50):

yikes. Under the "efficient ordering-based dict" i mention RRBT. I meant Red Black Tree :sweat_smile:

Robin Heggelund Hansen (Mar 10 2022 at 12:51):

Richard Feldman (Mar 11 2022 at 19:25):

Brendan Hansknecht (Mar 11 2022 at 19:36):

Random thought. With abilities, we could make Dict.* work on any dictionary, including those built in roc, right?

That could greatly streamline use if we can make other Dicts a drop in replacement with the exception of a few functions. Though I guess you could also just use an alias so maybe that doesn't matter too much.

Martin Stewart (Mar 11 2022 at 20:05):

If the Dict preserves insertion order then dict1 == dict2 will be false if dict1 and dict2 to have the same items but in a different order? If yes, that seems like a bit of a gotcha. That said, I've used AssocList a lot in Elm which has the same issue and in practice it hasn't been too much of an issue.

Brendan Hansknecht (Mar 11 2022 at 20:08):

Richard Feldman (Mar 11 2022 at 20:13):

dict1 == dict2 should determine its answer independently of insertion order, meaning it would likely be pretty slow if both dictionaries had the same length :stuck_out_tongue:

(that siad, dict1 == dict2 is not something I expet people to be be doing often outside tests!)

Richard Feldman (Mar 11 2022 at 20:13):

but that's true of hashmaps in general because hash collision resolution depends on insertion order

Richard Feldman (Mar 11 2022 at 20:14):

Martin Stewart (Mar 11 2022 at 20:18):

I think it's typical to expect that if dict1 == dict2 is true, then someFunction dict1 == someFunction dict2 is also true? If Dict.toList sorts based on hash then this invariant would indeed hold.

Richard Feldman (Mar 11 2022 at 20:19):

Richard Feldman (Mar 11 2022 at 20:20):

Richard Feldman (Mar 11 2022 at 20:21):

e.g. I want to say user1 == user2 in my test, and because User has a dictionary in it somewhere, and they were both built up differently, now I have to make my test dramatically different :grimacing:

Richard Feldman (Mar 11 2022 at 20:22):

:thinking: is (val1 == val2) == (someFunction val1 == someFunction val2) otherwise true in Roc for all other types besides dictionaries?

Richard Feldman (Mar 11 2022 at 20:22):

Martin Stewart (Mar 11 2022 at 20:23):

I think it holds for everything else. It's almost synonymous with referential equality?

Richard Feldman (Mar 11 2022 at 20:25):

heh, I guess an upside of making dict1 == dict2 compare based on insertion order would be that it would run faster :big_smile:

Richard Feldman (Mar 11 2022 at 20:26):

could always add a Dict.sameContents : Dict k v, Dict k v -> Bool which ignores insertion order, although it probably wouldn't be as convenient in tests

Richard Feldman (Mar 11 2022 at 20:29):

separate question: what if the names of the two remove functions were remove and removeOrdered, thus nudging you to use the faster one by default?

Richard Feldman (Mar 11 2022 at 20:30):

I think in the uncommon case where people really cared about order, someone could make a third-party package called OrderedDict which was an opaque wrapper around Dict except its remove function called removeOrdered under the hood

Richard Feldman (Mar 11 2022 at 20:30):

so that you'd have the guarantee that it never disturbed the order even accidentally, because the original remove was no longer in the API

Richard Feldman (Mar 11 2022 at 20:30):

and there'd be no runtime overhead since it would be an opaque wrapper around the original

Brendan Hansknecht (Mar 11 2022 at 20:34):

removeOrdered doesn't sound like it will necessarily take a long time, but I definitely think removeSwap should just be remove

Richard Feldman (Mar 11 2022 at 20:40):

Richard Feldman (Mar 11 2022 at 20:46):

worth noting, we can't guarantee this across the language. People can absolutely make custom equality implementations for their custom collections which don't obey this.

Richard Feldman (Mar 11 2022 at 20:46):

Brendan Hansknecht (Mar 11 2022 at 20:53):

Richard Feldman (Mar 11 2022 at 20:54):

Richard Feldman (Mar 11 2022 at 20:55):

Derek Gustafson (Mar 11 2022 at 20:55):

Richard Feldman (Mar 11 2022 at 20:55):

Martin Stewart (Mar 11 2022 at 20:59):

Richard Feldman (Mar 11 2022 at 21:02):

and seems like something that probably ought to be encouraged, even if it can't be a guarantee

Richard Feldman (Mar 11 2022 at 21:02):

another thing that occurs to me is that if == checks for contents only, it could lead to bugs in tests

Richard Feldman (Mar 11 2022 at 21:02):

Richard Feldman (Mar 11 2022 at 21:03):

because maybe you have code that relies on the ordering, and the test says "oh yeah these are ==, everything works fine" and then the ordering changes and the test misses it, and something breaks elsewhere

Derek Gustafson (Mar 11 2022 at 22:52):

But, I question the assumption that it will be significantly faster to be written in Rust

Derek Gustafson (Mar 11 2022 at 22:54):

With what @Brendan Hansknecht is already able to do, performance wise, for a hash function. And the light it shines on where we can improve the optimizations, I think there's a strong chance that a Roc based hashmap could be competitive.

Richard Feldman (Mar 11 2022 at 22:58):

Derek Gustafson (Mar 11 2022 at 22:58):

Richard Feldman (Mar 11 2022 at 22:58):

Derek Gustafson (Mar 11 2022 at 23:04):

From a practical point of view, you're right.
From a philisophical point of view, I feel like being able to say "Our standard library is written in out language" is a selling point

Derek Gustafson (Mar 11 2022 at 23:05):

I like to be able to look at a language's standard library to get an idea of what code should look like

Richard Feldman (Mar 11 2022 at 23:05):

I hear that, although I made a decision awhile ago that when the choice is between the stdlib going faster and being in Roc, the priority is for it to go faster :big_smile:

Derek Gustafson (Mar 11 2022 at 23:06):

Kevin Gillette (Mar 12 2022 at 04:50):

This could be pretty surprising. I would indeed expect that if dict1 == dict2, I would expect to be able to get the same iteration order out of either (thus they really _should_ be equal only if they have the same iteration order), yet it's equally surprising for the iteration order of a dictionary to have any bearing on equality.

I suppose this is one of the reasons that so many languages define dictionaries to have undefined or otherwise unordered iteration (so as to not commit to it being a property influencing equality), yet I also see how this is particularly important for FP.

Kevin Gillette (Mar 12 2022 at 04:51):

Anything for which equality is defined (which the language sufficiently understands) _can_ be used as mapping key (if every value has a single canonical representation, I suspect it is also hashable). This means that dictionaries and sets can themselves be keys (or set elements). This could be important to generic memoization, where the input may be a dictionary or set.

Kevin Gillette (Mar 12 2022 at 05:32):

iiuc, when two dictionaries share the same hash function (and perhaps same number of buckets/nodes), you can do a bucket-wise comparison: if any bucket has a different length than its pair in the other dictionary, then the two dictionaries are not-equal. If any kind deterministic ordering can be maintained between elements in a bucket, then elements within paired buckets can be pairwise-compared efficiently, and even in parallel across buckets. Maintaining intra-bucket ordering would be a heavier cost during insert, but could unlock efficient whole-of-set operations such as intersection, difference, etc.

Kevin Gillette (Mar 12 2022 at 05:37):

That seems apt. There are certainly use-cases for specifically ordered dictionaries, but ordering for a hypothetical Roc dictionary just seems to be an incidental outcome of seeking determinism.

Brendan Hansknecht (Mar 12 2022 at 06:00):

If two dictionaries share the same implementation, hash function, and capacity, you should be able to do bucket wise comparison. That is a lot of assumption though.

Also, in some maps it may not be possible to get an individual bucket. absl::flat_hash_map, for example, can not tell one bucket from the next without extra calculations in many cases. So you will get a group of buckets with interleaved elements as opposed to a single bucket.

Brendan Hansknecht (Mar 12 2022 at 06:03):

Could theoretically get the entire map in one chunk as if it was a single bucket, but that is mathematically unlikely.

Brendan Hansknecht (Mar 12 2022 at 06:03):

Most modern fast hash maps are quite similar internally and would see similar issues.

Brendan Hansknecht (Mar 12 2022 at 06:09):

I think that if equality doesn't depend on insertion order, we should not make the iteration depend on insertion order. Whatever we pick here, I feel it would be best to be consistent.

Brendan Hansknecht (Mar 12 2022 at 06:23):

As an extra note, even haskell allows for dictionaries with undefined iteration order.

Sure someone could use one to generate a seed for a random number generator, but I don't think the goal should be to develop for the pessimistic case where the user is trying to game the language.

Richard Feldman (Mar 12 2022 at 17:48):

I wonder why IndexMap stores the actual hash value alongside each entry :thinking:

Richard Feldman (Mar 12 2022 at 17:49):

because you'd already have all the key-value pairs in a contiguous chunk of heap memory

Brendan Hansknecht (Mar 12 2022 at 17:59):

It looks like they use it to update the indices hash table without having to recompute the hash.

Brendan Hansknecht (Mar 12 2022 at 18:00):

Richard Feldman (Mar 12 2022 at 18:14):

Richard Feldman (Mar 12 2022 at 18:15):

Brendan Hansknecht (Mar 12 2022 at 18:36):

Richard Feldman (Mar 12 2022 at 18:36):

Brendan Hansknecht (Mar 12 2022 at 18:37):

Which is super fast in the case of small keys and probably not a big deal, but really bad for large keys.

Richard Feldman (Mar 12 2022 at 21:22):

here's an idea for a modification of the IndexMap design: instead of having remove be "remove and swap," instead have it be "remove by leaving a tombstone"

Richard Feldman (Mar 12 2022 at 21:26):

Richard Feldman (Mar 12 2022 at 21:27):

having remove preserve order would definitely make it less surprising to use, which I like, and having it be a bit faster seems nice too

Richard Feldman (Mar 12 2022 at 21:27):

assuming that seems like a good idea, it would raise the question of how to represent a tombstone

Richard Feldman (Mar 12 2022 at 21:28):

one idea is to use the HashValue that's already stored next to the key and the value

Richard Feldman (Mar 12 2022 at 21:29):

have a HashValue of 0 represent a tombstone, which means we have to disallow an actual hash value of 0. That can be done cheaply by using (hash + (hash == 0)) instead of hash whenever we compute a hash value

Richard Feldman (Mar 12 2022 at 21:29):

so that would mean an extremely slight increase in chance of collisions, because anything that hashed to 0 would instead hash to 1

Richard Feldman (Mar 12 2022 at 21:30):

but it would only cost an addition and a compare to zero, and would only come up when actually computing a hash - so basically insert, get, and remove only

Richard Feldman (Mar 12 2022 at 21:31):

really the thing that I like most about it is that it seems like having remove not preserve order when literally every other operation does seems like a thing that would be a source of confusion

Richard Feldman (Mar 12 2022 at 21:31):

maybe bugs, maybe not, but certainly a lot of "wait why is it that way?" questions

Richard Feldman (Mar 12 2022 at 21:32):

seems like it would just feel a lot nicer to use if it Just Worked while preserving both order and asymptotics

Richard Feldman (Mar 12 2022 at 21:32):

plus then there'd be no real need for a separate OrderedDict wrapper data structure, since just using it normally would get you a completely ordered one

Richard Feldman (Mar 12 2022 at 21:33):

also wouldn't make it as weird to explain why equality preserved order if the design of remove (assuming it was essentially removeSwap) encourages that you treat it like an unordered map even though it's ordered

Brendan Hansknecht (Mar 12 2022 at 21:37):

There are hash functions that already have guarantee like this so you don't need to explicitly deal with it. I think wyhash might have a tombstone or 2. If not slightly modified versions of it should.

Richard Feldman (Mar 12 2022 at 21:38):

Brendan Hansknecht (Mar 12 2022 at 21:40):

So with this change, wouldn't every insert be O(n) to scan the list for tombstone? If not, tombstones will just collect and we will waste tons of memory, right?

Richard Feldman (Mar 12 2022 at 21:41):

my thinking is that they'd collect, but as soon as we run out of capacity and need to grow the array, we'd compact them when copying them into the new allocation

Richard Feldman (Mar 12 2022 at 21:42):

so that might happen more often than necessary, but that only happens if you're doing a bunch of removals followed by a bunch of insertions. If you know that's your use case, you can explicitly opt into removeSwap if you don't care about order, or if you do care about order, this is still better than removeShift because the shifting only happens occasionally instead of every single time

Richard Feldman (Mar 12 2022 at 21:44):

we could also have a heuristic like "if we only used X% of our capacity before running out, just compact in-place and don't make a new allocation"

Richard Feldman (Mar 12 2022 at 21:45):

so in that case it would be sort of like "lazily removeShift when we run out of space only"

Richard Feldman (Mar 12 2022 at 21:46):

although that might be a rare enough situation that maybe it's not worth storing the extra metadata to make it happen :thinking:

Richard Feldman (Mar 12 2022 at 21:46):

Richard Feldman (Mar 12 2022 at 21:52):

basically it seems like it would fit the "good default choice for a dictionary" criteria

Richard Feldman (Mar 12 2022 at 21:53):

if you end up with tombstones taking up too much memory, the performance optimization to fix it is easy: change from remove to either removeSwap or removeShift depending on whether you care about ordering

Brendan Hansknecht (Mar 12 2022 at 21:59):

The index map array is not tied to any capacity. Though I guess you could tie it to the capacity of the index hashmap.

Brendan Hansknecht (Mar 12 2022 at 22:00):

Richard Feldman (Mar 12 2022 at 22:01):

Richard Feldman (Mar 12 2022 at 22:02):

Brendan Hansknecht (Mar 12 2022 at 22:02):

Richard Feldman (Mar 12 2022 at 22:02):

although actually in practice it will most likely have to traverse less than an unordered hashmap

Richard Feldman (Mar 12 2022 at 22:02):

maybe a better way to say that: worst-case is same as unordered hashmap, but typical case is likely to be better :big_smile:

Richard Feldman (Mar 12 2022 at 22:04):

separately, I realized since we monomorphize, we could actually completely swap out the implementation for certain cases where we know it can be a lot faster, e.g. when the keys are integers

Brendan Hansknecht (Mar 12 2022 at 22:06):

One other thing I realized that might be the worst case for this change. The size of the data array is not at all tied to the capacity. If I delete and re-add the same element 1,000,000 times. I would get 1,000,000 tombstones but the capacity would never change of the index map.

Richard Feldman (Mar 12 2022 at 22:07):

oh, so I'm assuming that we have the same capacity value for both the indices and the data array

Richard Feldman (Mar 12 2022 at 22:08):

in fact we'd probably want to have one heap allocation for both, just partition it into (refcount, data array, indices)

Richard Feldman (Mar 12 2022 at 22:10):

but I think the real key here is that if you're doing a million removals and a million re-adds of the same element, it's probably time to consider using either removeSwap or removeShift instead of remove :laughing:

Brendan Hansknecht (Mar 12 2022 at 22:11):

hmm. If you did that, I guess that means you would get a lot of compactions. It would fill up the capacity of the data array with tombstones then compact and then repeat. So would just group compaction to whenever the capacity was full of tombstones.

Richard Feldman (Mar 12 2022 at 22:11):

Brendan Hansknecht (Mar 12 2022 at 22:11):

Also, removing and reinserting an element many times is not at all uncommon in hot loops of code.

Richard Feldman (Mar 12 2022 at 22:11):

Brendan Hansknecht (Mar 12 2022 at 22:12):

Richard Feldman (Mar 12 2022 at 22:12):

well, if you're removing and re-inserting in a hot loop, I certainly hope you don't care about insertion order :stuck_out_tongue:

Brendan Hansknecht (Mar 12 2022 at 22:12):

Sure, but we are making the dict care about insertion order by default even if the user doesn't care

Richard Feldman (Mar 12 2022 at 22:12):

Brendan Hansknecht (Mar 12 2022 at 22:13):

Yeah, but I thought the point of this change was to remove the need for users to know about removeSwap and removeShift

Richard Feldman (Mar 12 2022 at 22:14):

oh, no - not to have them not need to know about it, more to have the default path work intuitively and have good performance in common use cases

Brendan Hansknecht (Mar 12 2022 at 22:14):

Then I would just default to removeSwap. The common case is definitely not depending on the insertion order of a dictionary

Richard Feldman (Mar 12 2022 at 22:16):

I guess we could try it that way and see if people find it confusing in practice :thinking:

Richard Feldman (Mar 12 2022 at 22:16):

Kevin Gillette (Mar 13 2022 at 00:14):

Perhaps the ordering strategy can just be declared at dictionary initialization time, if one ordering or another is important beyond just guaranteeing determinism of some kind? It seems more valuable to me to have a whole-of-dictionary policy rather than choosing removeShift in this case, then removeSwap immediately after that (when and why would such flexibility be needed)?

If removeSwap only exists for performance, and, were removeShift somehow had the same performance, it doesn't seem like there'd be any reason for removeSwap (we've used phrases like "that's clever" or "it'll be faster," but not "that's actually what I semantically want anyways"). With tombstoning, which sounds promising, it sounds like we're exploring ways to achieve reasonable performance without sacrificing semantics.

An aside: there are likely cases we can easily optimize, such as when a dictionary is proven not to ever be iterated upon (only insertions, deletions, or keyed lookups are used), in which case we don't even need IndexMap, or if we use it anyways (such as for better icache performance), we can use removeSwap for all removals, because order will never be concerned. This decision can also change at function boundaries: if I define a function which accepts a dictionary, I generally don't get to have assumptions about the ordering of items in my input (because if I made such assumptions, the caller could just initialize the dictionary in a different order anyways), thus if the caller doesn't use iteration, then the caller's deletions can use removeSwap, even if my own function uses iteration after deletion, thus my own deletions use removeShift or tombstoning implicitly.

With tomestones, the "reached capacity, thus do a tombstone removal sweep" will likely lead to measurable latency spikes. We can instead use a strategy of making any update pay for a partial tombstone cleanup, such that we maintain an offset in the dictionary indicating the index of the first tombstone (which is the length of the array when there are no tombstones):

Also, two indices (start of first tombstone, and "shift horizon" after the first tombstone) would generally allow for shifting without scanning (or when tombstones collide into spans, thus the horizon offset points at a tombstone, the updating operation is only responsible for advancing the horizon by, say, 2n, presuming reads are cheaper than writes).

Further, since computers are good at memcpy operations, perhaps a index of last tombstone is better, as that allows the tail of array to tend to be contiguous, and larger spans of elements could be shifted "left" via memcpy.

Sorted dictionaries will often be convenient. We may be able to leverage tombstones to make these not too expensive, such as over-allocating array space to allow growth, and using tombstone gaps as an optimization for keeping elements ordered while allowing inputs to only require shifting a finite, or perhaps logarithmic, number of items that follow (or precede, depending), rather than all elements that follow.

Richard Feldman (Mar 13 2022 at 01:05):

here's a simple idea: use IndexMap but don't expose removeShift directly, and have Dict be described in the docs as not guaranteeing that insertion order is preserved

Richard Feldman (Mar 13 2022 at 01:05):

however, expose sufficient primitives that someone can build OrderedDict in userspace on top of it

Richard Feldman (Mar 13 2022 at 01:06):

or perhaps it's possible to build a sufficiently fast one in pure Roc, without even using the builtin Dict!

Kevin Gillette (Mar 13 2022 at 01:15):

I am somewhat confused about the big picture. The discussion thus far seems to presuppose Roc dictionary-handling code that can be optimized into in-place mutation, and though we can anticipate that such optimizations can be regularly applied, given that Roc is a language without [observable] side effects, the language allows (and the programmer can even take advantage of) cases in which dictionaries may diverge: storing the value of a dictionary, both before an after an update, in two variables; or passing the dictionary to two separate functions that each make their own updates (which need to get "dropped" if the dictionary is not passed back); or, presumably since concurrent computation would be valuable for the types of computing loads that Roc is targeting, passing a partially initialized dictionary to concurrent tasks which each make their own modifications simultaneously.

Then there's the pathological case of non-tail recursion, where each call passes a modified copy (presume some mix of inserts, updates, and deletions) to the next invocation and then uses the received result alongside some iteration of the dictionary state it already had (thus making it very difficult to optimize this into a single dictionary with in-place mutations).

It does not seem like IndexMap, or indeed most other hashmap implementations, would be suitable for such cases. iirc, Elm uses something with similar properties to a b-tree, such that each update merely is some mask over a reference to the previous state. Would we use a similar technique in cases where optimization is not possible, or would the compiler just arrange to copy the dictionary when it detects that optimizations are no longer safe, or would it use layered dictionaries with multiple tags (Normal, Tombstone, MaskedDelete, MaskedUpdate)?

If the compiler can generate multiple dictionary implementations depending on the optimizations available, would the implementations be transparent and compatible (i.e. would dictionaries containing the same semantic contents be considered equal even if they have differing internal implementations/representations)?

Richard Feldman (Mar 13 2022 at 01:28):

Richard Feldman (Mar 13 2022 at 01:29):

Roc's entire performance design is based on a hypothesis that this will be either true in practice, or will diverge rarely enough (e.g. costing one or two automatic clones here and there, and then each of those clones gets in-place optimized from then on because each clone is now unique) that this overall strategy will pay off, and be faster overall than persistent data structures would have been

Richard Feldman (Mar 13 2022 at 01:29):

Richard Feldman (Mar 13 2022 at 01:30):

nobody's ever tried this design before in a production language, as far as I know, so we're in uncharted territory and forging our own path

Richard Feldman (Mar 13 2022 at 01:33):

I don't think this would ever come up. By definition, the compiler can only possibly generate different implementations based on knowledge it has at compile time, which in practice would mean monomorphized type information. In order for two things to be compared for equality, they have to have the same type - so if the compiler could possibly generate a different implementation, then they couldn't possibly be compared for equality!

Kevin Gillette (Mar 13 2022 at 02:35):

Jared Cone (Mar 13 2022 at 04:52):

I'm in gamedev where perf is often important, but I would still prefer a Dict that preserves order by default at a slight perf cost. IMO it's easier to profile and fix perf issues than finding and fixing bugs caused by false assumptions.

Jared Cone (Mar 13 2022 at 04:55):

Regarding equality depending on ordering: I'm new to pure functional, but I would assume that if two values are equal then I could pass either into a pure function and get the same result for both. If ordering isn't considered for equality then that assumption doesn't hold.

Richard Feldman (Mar 16 2022 at 23:06):

:thinking: what would be a situation where using a dictionary that preserves insertion order is better than using an ordinary array? (Context: I'm wondering whether Dict.removeShift is actually a good idea or just a footgun because if you want that, you're better off using a List instead.)

does anyone have firsthand experience using a dictionary that preserves insertion order?

Brendan Hansknecht (Mar 16 2022 at 23:46):

If you still want O(1) lookup of a value based on a key? an ordinary array doesn't give you that.

But in general, I have never seen code that explicitly depends on insertion order of a dictionary or hashmap. Depends on having order based on the key or value, sure, but never seen insertion order mattering.

Brendan Hansknecht (Mar 16 2022 at 23:47):

I wonder if there is a way to look up all the crates that depend on indexmap in rust. They theoretically would have a reason.

Richard Feldman (Mar 16 2022 at 23:53):

yeah so far the use cases I've heard of that would apply here are "I'm transforming some configuration JSON or TOML and I want to write it back out in the same order to minimize git diffs"

Richard Feldman (Mar 16 2022 at 23:55):

Add it certainly seems reasonable to potentially want to do a transformation that removes an entry while preserving order

Richard Feldman (Mar 17 2022 at 00:43):

relatedly, another use case that came up (but which doesn't affect removeShift) is an LRU cache

Richard Feldman (Mar 17 2022 at 00:44):

we could support that use case really nicely if we just had a Dict.replaceAt that lets you replace whatever entry happens to be at the given index

Richard Feldman (Mar 17 2022 at 00:45):

so you'd give it a new key, value, and index, and then all you'd need to do to make an LRU cache would be to separately track the index and do modulo on the capacity of the dictionary so it wraps around when you get to the end

Brendan Hansknecht (Mar 17 2022 at 01:21):

Richard Feldman (Mar 17 2022 at 01:22):

we ended up settling on the IndexMap implementation that keeps it densely packed

Brendan Hansknecht (Mar 17 2022 at 01:23):

Brendan Hansknecht (Mar 17 2022 at 01:29):

I think it has a problem for LRU cache. How do you update an element to move it to the front of the line. You would need to swap it to the front of the list and shift all elements between it and the front of the list over by one.

Richard Feldman (Mar 17 2022 at 01:38):

yeah you'd need to do something like remove to have it be fast, swap it with something else

Richard Feldman (Mar 17 2022 at 01:38):

Kevin Gillette (Mar 17 2022 at 01:41):

Least Frequently Used is often preferable compared to Least Recently Used. With LFU, you're at most swapping two indices in the array, which is pretty cheap. Frequent items shift to the front of the list, rarely used shift towards the end and get pruned

Brendan Hansknecht (Mar 17 2022 at 01:42):

Doesn't LFU normally store with PQ or something of that nature so that it is sorted and you can find the least element?

Brendan Hansknecht (Mar 17 2022 at 01:43):

In that case, I don't think the insertion order dictionary would be that helpful in general, but maybe I am missing something.

Kevin Gillette (Mar 17 2022 at 01:43):

Yes, but if the Dict value contains the frequency info, and we can have arbitrary control over swapping items in the order array, you've got your priority queue

Brendan Hansknecht (Mar 17 2022 at 01:44):

hmm. I guess with replace at, whenever you increment the counter for an element, you could check the indices next to it and potentially swap it. Though feels weird to give the end user direct access over the underlying implementation detail of the Dictionary.

Kevin Gillette (Mar 17 2022 at 01:45):

yes, agreed, but since some kind of order is important to the functional notion of equality, then we have an opportunity to embrace order

Kevin Gillette (Mar 17 2022 at 01:46):

And an LFU cache would probably abstract away those details even when using, internally, a Dict in a low level way

Brendan Hansknecht (Mar 17 2022 at 01:47):

One disadvantage is that if swapping is expensive for values, you will be taking that performance hit. with a priority queue just on the keys and a regular hash map, that could be faster. Though you could always box the value if it is large...so yeah, sounds like an LFU would work well here.

Brendan Hansknecht (Mar 17 2022 at 01:48):

instead of array of (keys, count) and a hashmap from key to value. It would be an ordered dict from key to (count, value)

Kevin Gillette (Mar 17 2022 at 01:49):

Kevin Gillette (Mar 17 2022 at 01:50):

and nicer because a removal from the array or map implicitly involves a removal from the other as well

Kevin Gillette (Mar 17 2022 at 01:51):

Richard Feldman (Mar 17 2022 at 02:17):

update : Dict k v, k, ([ Val v, NoVal ] -> v) -> Dict k v
updateAt : Dict k v, Nat, ([ Entry { k, v }, NoEntry ] -> { k, v }) -> Dict k v

...then maybe we could do an LRU cache too, by storing something like LruCache k v := Dict k { next : Nat, prev : Nat, val : v } which records the indices of the next and previous values

Brendan Hansknecht (Mar 17 2022 at 02:19):

Sure, but then the insertion order of the dict doesn't really matter. you are storing the related link list inline and that could be done with any dictionary.

Brendan Hansknecht (Mar 17 2022 at 02:20):

I guess you would always want to swap the LRU item with then end of the dictionary to make it easy to find.

Brendan Hansknecht (Mar 17 2022 at 02:20):

Brendan Hansknecht (Mar 17 2022 at 02:23):

Richard Feldman (Mar 17 2022 at 02:23):

Brendan Hansknecht (Mar 17 2022 at 02:24):

Yes, but then when they get a no entry, they give you a { k, v } to insert there.

Brendan Hansknecht (Mar 17 2022 at 02:24):

Richard Feldman (Mar 17 2022 at 02:24):

Richard Feldman (Mar 17 2022 at 02:25):

Richard Feldman (Mar 17 2022 at 02:26):

Richard Feldman (Mar 17 2022 at 02:27):

update : Dict k v, k, (v -> v) -> Dict k v
updateAt : Dict k v, Nat, ({ k, v } -> { k, v }) -> Dict k v

Richard Feldman (Mar 17 2022 at 02:28):

and if the key is not found, or the index is out of bounds, we return the original dict unmodified

Richard Feldman (Mar 17 2022 at 02:35):

Richard Feldman (Mar 17 2022 at 02:36):

updateOrInsert : Dict k v, k, ([ Entry v, NoEntry ] -> v) -> Dict k v

Kevin Gillette (Mar 17 2022 at 04:10):

emptyDict
  |> update "x" \_ -> 1
  |> update "y" \_ -> 2
  |> updateAt 0 \_ -> {k: "z", v: 3}

In other words, would z completely replace x?
An explicit index swap operation would be quite convenient for those cases in which the keys and values should remain intact, but only the relative ordering should change.

Richard Feldman (Mar 17 2022 at 04:13):

...then it would still be an empty dictionary at the end, which honestly seems less surprising to me :sweat_smile:

Richard Feldman (Mar 17 2022 at 04:14):

Kevin Gillette (Mar 17 2022 at 04:16):

Sure, there will be use-cases where best-effort is ideal. Pairing that with "checked" operations of some kind, will cover the cases in which the user needs to know whether the change succeeded, though I suppose that can also be done with index/key operations which can return a Result.

Brendan Hansknecht (Mar 17 2022 at 04:24):

emptyDict
  |> insert "x" 1
  |> insert "y" 2
  |> insert "z" 3
  |> updateAt 0 (\_ -> {k: "z", v: 2})

{ "z": 2, "y": 2 }? cause giving { "z": 2, "y": 2, "z": 3 } would be broken.
The update at just feels really strange to me because it ignores the existing elements.

Brendan Hansknecht (Mar 17 2022 at 04:27):

Also, if update at creates { "z": 2, "y": 2 }, it means that it has to remove the old z. In that case, I assume it would do a swap remove which would reorder part of the list? This might have some weird gotchas if you try to use it for an LRU or LFU cache.

Kevin Gillette (Mar 17 2022 at 05:49):

Indeed, so any updateAt could actually impact up to n elements in the array (and 2 [?] 3 [?] elements in the hashmap), such as with:

emptyDict
  |> insert "x" Before
  |> insert "y" Before
  |> insert "z" Before
  |> updateAt 1 (\_ -> {k: "x", v: After})

y would disappear, and x would be moved to the middle element (which would cause the first element, the former location of x, to disappear?), and then, without tombstones, z and any elements after it, would need to shift to fill the newly formed gap.

Brendan Hansknecht (Mar 17 2022 at 05:52):

That or swap the last element into the hole since that is the default remove functionality?

Kevin Gillette (Mar 17 2022 at 05:53):

For an LFU cache, I'd just rely on single swaps, typically adjacent: as soon as you get a cache hit, you increment the count for that key, and if the increment puts it higher than its left-side (lower index) neighbor, swap the two positions. In the pathological case where there's a run in k keys that all have the same count and you increment the last one in the run, then just swap the first and last elements in that run (the relative ordering among elements with the same count shouldn't matter if they have the same count). Scanning to find the swap partner might thus take arbitrary time without some auxiliary data of some kind, but the actual swap would always be constant time.

Kevin Gillette (Mar 17 2022 at 05:55):

In the general case, that could be fine, or not, depending on whether, and how, the application depends on ordering. For applications which don't care about Dict ordering, then any swap to fill a hole would be fine, assuming the swap partner is deterministically selected.

Giovanni Zanandrea (Mar 21 2022 at 12:16):

I wonder if it wouldn't be better to have a builtin dictionary that doesn't expose the ordering of the elements at all, and then build something similar to Rust's IndexMap on top of it in a library

The builtin version wouldn't have a toList method (or any other methods that exposes an internal ordering), but it could still have all order-independent methods like toSortedList, map, lookup, update and so on. And it wouldn't have the above mentioned problems with equality.

From this discussion it sounds like the version of the dictionary that remembers the insertion order would have to store the order in a separate array anyway, so implementing it on top of the order-independent builtin version should not create performance problems. And since it looks like there's more than one useful way to manage the order (removeSwap, removeShift, ...), maybe we could also have different versions of the ordered dictionary for different use cases.

Anton (Mar 21 2022 at 12:30):

Using toList on a dict comes up fairly often in data science so I imagine it would be missed.
In data science you may also have very large dicts, so you don't want to waste time sorting things when it's not needed.

Brendan Hansknecht (Mar 21 2022 at 14:51):

Brendan Hansknecht (Mar 21 2022 at 14:52):

Giovanni Zanandrea (Mar 22 2022 at 07:11):

Sorry, I don't think I explained myself clearly here. I didn't mean to suggest that we should give up the ability to iterate through the elements of a dictionary without first doing some expensive sorting. The point I was trying to make is a different one. So please allow me to elaborate a bit. As an example I'll use a set rather than a dictionary: the issues are the same, but using sets simplifies the discussion a bit.

A set is, by definition, a collection that contains no duplicates and whose elements are not ordered. A "set" that remembers the order in which its elements were inserted is not a set anymore, it's something else. It's a list/queue without duplicates that also provides the ability to quickly locate the position of a given element. I personally feel very uncomfortable about having a set or dictionary type that implements methods like the above mentioned updateAt.

    template <typename T> class not_a_set {
      // Main data structure, the one that contains all the elements in their proper order
      vector<T> elements;

      // Auxiliary data structure, maps a hashcode to the
      // indexes of all elements that have that particular hashcode
      unordered_map<u32, set<u32>> hashcode_to_indexes;

    public:
      bool operator == (const not_a_set<T> &other) const {
        // For a comparison we can ignore hashcode_to_indexes,
        // elements contains all the information we need
        return elements == other.elements;
      }

      bool contains(const T &element) {
        return find(element, calculate_hashcode(element)) != 0xFFFFFFFF;
      }

      vector<T> &to_list() {
        return elements;
      }

      void insert(const T &element) {
        u32 hashcode = calculate_hashcode(element);
        u32 idx = find(element, hashcode);
        if (idx == 0xFFFFFFFF) {
          elements.push_back(element);
          hashcode_to_indexes[hashcode].insert(elements.size() - 1);
        }
        else {
          ...
        }
      }

      void remove_swap(const T &element) {
        ...
      }


    private:
      u32 find(const T &element, u32 hashcode) {
        set<u32> &indexes = hashcode_to_indexes[hashcode];
        for (auto it=indexes.begin() ; it != indexes.end() ; it++) {
          u32 idx = *it;
          if (element == elements[idx])
            return idx;
        }
        return 0xFFFFFFFF;
      }
    };

The main data structure here, the one that contains all the elements, is a simple list/vector/array. The other member variable (hashcode_to_indexes) is only an auxiliary data structure used to quickly locates the index of a given element.

Node that here I never need the ability to iterate through hashcode_to_indexes, because if I need to iterate through the collection, I'll just iterate through the elements vector.

Also, this is what I would do if all I had to implement was remove_swap(...). If I had to (efficiently) implement remove_shift(...) I would use a completely different (and more expensive) data structure.

So if I had to design a collection library I would start by implementing unordered collections like HashSet and HashDict, probably as builtin types, for efficiency. I wouldn't want their elements to have any externally observable order because the order of the elements is extra information we would have to keep track of, and that's bound to constraint the implementation and make it less efficient. And since there's no order, I would avoid any method (like for example toList) that exposes the internal ordering of the elements, because that would introduce nondeterminism in the language.

Then I would use those builtin types as building blocks to implement more capable library types like NotASetWithRemoveSwap, NotASetWithRemoveShift, NotADictWithRemoveSwap, NotADictWithRemoveShift and so on (I would choose nicer names, of course).

Then in my application code I would use HashSet and HashDict only when performance is super important and I don't need to iterate through the entire collection, and the derived library types in all the other cases.

Richard Feldman (Mar 22 2022 at 11:08):

that seems like a reasonable (and clever!) design to me except that one of the original goals was to come up with a "default dictionary" (and Set) that almost everyone in the ecosystem feels comfortable reaching for unless they have very unusual requirements

Richard Feldman (Mar 22 2022 at 11:10):

I'd expect iterating over the dictionary to be a common requirement, so if we adopted this design, I think we'd want to have both data structures included in the builtins

Richard Feldman (Mar 22 2022 at 11:11):

at which point I think the guidance would be "always use the ordered one because you can iterate over it"

Richard Feldman (Mar 22 2022 at 11:11):

Richard Feldman (Mar 22 2022 at 11:13):

I understand your point about "if it's ordered, it no longer feels like a Set" - but if the advice is "when you want a Set, don't reach for the thing named Set because you can't iterate over that one" seems to be unavoidably at odds with the design goal of having an obvious default data structure to reach for! :sweat_smile:

Brendan Hansknecht (Mar 22 2022 at 15:29):

Firstly, I want to make a clear distinction between mathematics and implementation. For example, any programming implementation of a set contains ordered elements. They are all implemented on top of some form of list. That is part of the linear memory space we on constrained to. With that in mind, I would say that the simplest implementation of a set, is just a list. As long as you scan the entire thing when you insert, you can make sure it will never have duplicates. That alone is enough to build the data structure we have describe above. It would be slow, but it would a minimal functional implementation that has no dependency on any underlying dictionary/set implementation.

Why does that matter? Once mathematical definition and implementation details are separated, an important property opens up. We get to decide how much of them implementations is useful beyond the core set functionality. Given we are working with constrained hardware, taking advantage of implementation knowledge is very important for performance. The reason for things like updateAt is not because we think that it is a reasonable method on mathematical sets. updateAt is part of our specific implementation that trivially enables multiple use cases, like an LRU or LFU cache. Without exposing this, those data structures would get built extremely differently in user space. Most likely in a more complex and slower form.

All of this said, I would also like to note, that I personally think that we should just say that referential transparency is only per individual complilation, expose Num.appSeed, and make our default dictionary an unordered form with iteration. At least currently, that is not the plan for Roc.

Giovanni Zanandrea (Mar 23 2022 at 09:45):

OK, gotcha, but I still feel that either the mathematical or the computer science community should come up with a short and catchy name for the kind of data structure we're talking about. We have the term set, for collections that have no order and no duplicates; bag/multiset for those that can have duplicates, but no order; sequence/list for those that can have duplicates and whose elements are arranged in a linear order; but nothing for those with linear order but no duplicates. A glaring omission that really bothers me.

About the implementation vs mathematical definition: I think there are very strong arguments for sticking to the mathematical definition whenever possible. The mathematical structures have properties that are usually lost when you deviate from them. These properties are for example used by query engines for declarative languages like SQL or Datalog to rewrite queries and find more efficient ways to execute them. I'm no expert on that, but I've read that even the fact that SQL relations can contain duplicates makes optimization more complicated. Another benefit of not committing to any particular implementation is that, yes, you'll loose the extra perks that specific implementation can provide, but on the other hand the compiler/runtime/DBMS will be free to pick whatever implementation is best suited for any given task. And that's exactly what I do with my own language, which makes heavy use of sets and relation. The compiler analyzes the entire code base to see how each individual relation is used, and then picks the most efficient implementation that provides the required functionality. Without that, there's no way I could be competitive in terms of speed with, say, Java. And when it's not possible (or just too hard) to pick the best implementation at compile time, it's done under the hood (with much less efficiency, unfortunately) at runtime (More on that below.)

I'm sympathetic to the idea of restricting referential transparency to individual compilations, and that's to some extent what I do in my own language, since I see no way there to provide deterministic iteration over unsorted collections without totally killing performance. And in my case the situation is much worse that it would ever be in ROC, not only because sets and relations are the primary way to encode data, but also because mine is an embeddable language: the compiler generates a sets of classes in Java, C# and (soon) C++, that are meant to be included in an existing codebase, and each of those code generators use a totally different notion of internal ordering. In spite of that, I've never had in practice to fix a bug caused by nondeterminism, and I think that's even less likely to happen in ROC (it might be more of a problem in imperative languages, I guess). It's still annoying though that the same program can produce different (but equally valid) outputs when compiled by different versions of the compiler. I'm currently working on the C++ code generator, I'd often like to test it against the Java one by comparing the outputs of the same program, but in many cases I can't, because those outputs are different.

On a separate, but related, subject: is the idea of adding tree-based sets and dictionary to ROC being considered? They can provide functionalities that are not available in their hash-based counterpart (min, max and i-th elements, range queries, ...) plus they have a meaningful and efficient iteration order, and I think they can greatly benefits from being implemented as builtins. Let me explain why.

In my own language sets and maps are a builtin type, just like, for example, lists, and their implementation is an implementation detail that doesn't affect the semantics of the language. When a set/map is created in a single operation, for example using map/filter or a literal, the language stores it using a sorted array, which is the most memory-efficient representation. But if afterwards an individual element/entry is added to, or removed from, the set/map, the runtime transparently switches (in O(1)) to a hybrid representation that combines binary trees and sorted arrays (basically, a binary tree whose leaves contain sorted arrays, not just single elements) which can support O(log(N)) insertions and deletions. If this binary tree later needs to be copied (something that could be pretty common in ROC, as far as I understand), then the copy is arranged in the more efficient sorted array representation.

Some of this can be implemented in user space, but not all of it. For example having to iterate through a very deep binary tree is a pretty good opportunity to very cheaply rearrange its content in the more efficient sorted array representation, but since iteration is not supposed to change the collection this has to be done under the hood.

I think this applies (to a lesser extent) to the NotASet/NotADict types as well. For example, one reasonable way to implement removeShift (which, I believe, is a much more logically sound default than removeSwap) is to leave a hole in the array and postpone the (expensive) shifting operation until it's actually needed. And an iteration over the collection is an excellent opportunity to do that at a minimal cost, but again, that has to be done under the hood. Same goes for having to make a copy of the collection.

One more silly though: if the difference in performance between HashSet/HashDict and NotASet/NotADict turns out to be significant, wouldn't it be possible for the compiler to automatically replace the latter with the former whenever it can detect that the ordering is not used? I realized that's difficult to do in a functional language, but at least in some simple cases that might be within the ability of a moderately sophisticated analyzer. It may turn out to be not much harder than detecting at compile time when a data structure has a single reference and can be safely updated in place, which I seem to understand you guys are already doing?

Richard Feldman (Mar 23 2022 at 12:59):

I think the question is whether the benefits would actually be worth it in practice. :big_smile:

Kevin Gillette (Mar 23 2022 at 13:17):

I'm not able to find a mention of representational ordering, or lack thereof, at least on the [Wikipedia page] (https://en.m.wikipedia.org/wiki/Finite_set), though I'm not surprised because abstract mathematics doesn't care about performance (statements don't need to be computable in finite time or finite memory), and so there's no difference between "iterate over this set which contains some integers..." and "enumerate over every real number, and for those which are present in this set..."
The page does mention a lot about order, but not _representational_ order (irrelevant at the abstract mathematics level), but rather as a way to define the finiteness of finite sets, and as a way to relate sets themselves relative to each other (defining subset and so on).

I'm not so sure that it's actually mandated by computer science definitions that there be no defined order (for either a set nor a mapping). Languages often do not define order so that they can change the implementation without breaking expectations, and also so that they can use less memory or utilize simpler internal structures. If ordering had no implication on internal complexity or overhead, then I suspect we'd see defined order as the default.

In a referentially transparent language, like Roc, it's critical for the same operations to produce the same answer, which implies that data structure ordering must at least be deterministic based on creation: at least within a single program run, constructing a set with b, then a, must iterate in the same order as any other set created with b, then a. It has been called out that, if we can provide consistency across program runs, and across binaries compiled within the same major version of Roc, then that'd be a lot less confusing.

We certainly can give special names to sets and types which have a defined order (in Rust, the nearest equivalent being discussed is IndexMap). But is the value in that to distinguish between multiple types within the same language, or to just clarify the semantics of the default type? My sense is that it's the latter.

I've worked on a 3rd party set implementation for Go, which happens to be based on sorted, deduplicated arrays
and merge algorithms (and it happens to be much faster to do intersection and such that way than is available via Go's builtin maps): I never thought of calling this anything other than just "Set". Usually most people wanting to print out a set would prefer to do it in sorted order (when the elements are sortable), so it saves them effort if they don't need to convert to a list and then sort.

Brendan Hansknecht (Mar 23 2022 at 17:36):

Personally I see that as the job of the human, but I tend to work on lowish level stuff. So very clear bias. I also am a strong believer in Hyrum's Law, so I think it would be exceedingly hard to do this on software at scale. Minute differences will lead to breaking changes. That or hiding the differences will lead to ~~unacceptable~~ performance losses.

I don't understand this. Java picks a specific implementation. You could just pick the exact same implementation and be competitive with Java. I also would bet that there are some implementations that are all around better than the Java implementation.

I find it really interesting how Roc basically has the same constraint. It is just always embedded using object files and C abi.

Brendan Hansknecht (Mar 23 2022 at 17:39):

Can't you do unordered comparisons of the output, or is it more complex then that?

Brendan Hansknecht (Mar 23 2022 at 17:41):

I don't think their iteration is efficient (lots of pointer chasing), but otherwise, totally agree. Maybe we need to reconsider having an ordered and unordered dictionary type base on thing like (min, max and i-th elements, range queries, ...).

That being said, I think it would be easier to build an efficient tree based set in roc than an efficient hash based set in roc.

Brendan Hansknecht (Mar 23 2022 at 17:44):

Aren't these contradictory? If there are observable difference, it must effect the semantics?

Brendan Hansknecht (Mar 23 2022 at 17:45):

Brendan Hansknecht (Mar 23 2022 at 17:46):

Yeah, should be doable in userspace, but would be really odd: Why does iterating over my set return a new set?

Brendan Hansknecht (Mar 23 2022 at 17:49):

This would be really expensive in a common dictionary use case of simply adding, removing, and looking up many items. There would never be a good time to run the remove shift. The underlying memory allocation would just keep growing due to all of the empty slots in the underlying list. It is not possible to fill in the empty slots without making insertion an O(n) scan, and that would break our insertion ordering anyway.

Giovanni Zanandrea (Mar 24 2022 at 14:37):

As I see it, that's a somewhat watered-down version of referential transparency. Yes, program runs are guaranteed to be deterministic, but individual functions are not:

  s1 = emptySet |> insert 1 |> insert 2 |> insert 3
  s2 = emptySet |> insert 2 |> insert 3 |> insert 1
  r1 = f s1
  r2 = f s2

s1 and s2 are identical for any meaningful definition of set, but r1 and r2 need not be. Note that the problem completely disappears when using binary trees or ordered arrays. Don't get me wrong, I totally agree that even this looser definition of referential transparency is extremely useful in practice, and I'll be very happy to take it, but I still like the real version better

Giovanni Zanandrea (Mar 24 2022 at 14:44):

No. There are observable differences between the different versions of the compiler, but for any given version you cannot tell how a collection is implemented.

It doesn't. That's exactly my point. It just rearranges the content of the existing set under the hood. That's why that cannot be implemented in userspace in a pure functional language. I think this can be explained best with a few lines of code:

      // Creating a new sequence/list. The list contains a mix of different data types
      // (strings, integers, floating point numbers, dates, times and symbols) so there
      // can be no meaningful notion of order between them
      seq = ("Hi!", 42, `2022-03-24`, 3.14159, :a_symbol, `2022-03-24 10:20:53`, ...);

      // Turning that sequence into a set, using a comprehension expression. The set
      // is created with a single operation, so it is represented as a sorted array
      set_1 = [x : x <- seq];

      // Turning again the sequence into a set, but starting with the empty set and
      // inserting elements into it one at a time. Here the compiler is forced to use
      // a less efficient tree-based representation
      set_2 = [];
      for x <- seq
        set_2 = insert(set_2, x);

      // In spite of the fact that set_1 and set_2 have different
      // internal representations they still compare equal
      assert set_1 == set_2;

      // Iterating through the first set, and storing its element in a sequence. This
      // is the kind of operation that exposes an implementation detail that ideally
      // should be kept hidden:
      seq_1 = ();
      for x <- set_1
        seq_1 = (seq_1 | x);

      // Doing the same with the other set
      seq_2 = ();
      for x <- set_2
        seq_2 = (seq_2 | x);

      // Despite the fact that set_1 and set_2 have different representations, their
      // iteration order is the same and the resulting sequences compare equal
      assert seq_1 == seq_2;

      // Now set_2 is not stored as a binary tree anymore. It's now using a sorted
      // array. During the previous iteration iteration its elements were internally
      // rearranged as a sorted array. Since we're already retrieving all elements
      // of the set, the rearrangement can be done with minimal overhead, but it
      // has to be done under the hood. If there had been other references to set_2
      // throughout the program, they would all now see the sorted array

      // Inserting a new element into set_1 (which uses a sorted array).
      // The result, set_3, is represented as a single node tree. The root
      // stores the new element (2.71828), the left branch points to the section
      // of set_1's array that contains the elements that come before the new
      // one, and the right branch points the other section of set_3's array
      // The whole insertion takes O(log(N))
      set_3 = insert(set_2, 2.71828);

      // These two print instructions now produce the exact same output, but
      // this output will be different with each version of the compiler
      print seq_1;
      print seq_2;

Another example of separation between logical definition and physical implementation is records. Records in my language are just dictionaries (which are in turn just a special case of binary relations) that map symbols to arbitrary values (that has several advantages that are explained here). But of course they need an efficient representation. So if you define in your code a type like this one:

      type Rect = rect(left: Int, bottom: Int, width: Nat, height: Nat);

      typedef struct {
        int64 left_L;
        int64 width_L;
        int64 bottom_L;
        int64 height_L;
      } OBJ_Rect;

and all values that belong to the type Rect will be stored using that optimized data structure. You can create them directly:

      rect = rect(left: 4, bottom: 1, width: 5, height: 3);

      tag = :rect;
      keys = (:left, :bottom, :width, :height);
      values = (4, 1, 5, 3);
      rect = tag([k -> values(i) : k @ i <- keys]);

      left = 4; bottom = 1; width = 5; height = 3;
      str = "rect(left: " & print(left) & ", bottom: " & print(bottom) & ", width: " &
                print(width) & ", height: " & print(height) & ")";
      res = parse(str);
      rect = if succeeded(res) then result(res) else undefined;

but you'll end up with the same physical implementation (obviously, only the first way of creating it is efficient). If you hadn't defined the Rect type, the above snippets of code would still be valid, but the result would be stored as a generic (tagged) dictionary.

      // The following snippet of code prints `42` when compiled with
      // the C++ code generator, and `2022-03-24` with the Java one
      set = ["Hi!", 42, `2022-03-24`, 3.14159, `2022-03-24 10:20:53`];
      for x <- set {
        print x;
        break;
      }

The C++ version at least guarantees the natural iteration order for things like numbers, dates and times without affecting performance, but when generating Java code I haven't even been able to do that.

I used the expression "tree-based" for simplicity, but what I really meant was an implementation (or a combination of implementations) that rely on ordering (as opposed to hashcodes) to organize the elements of the set/dictionary. Once you start using sorted arrays under the hood as much as possible, in the average case everything becomes a lot faster. And there are other implementations as well. Kevin Gillette in the previous post mentioned one that I'm not familiar with and which I'm actually curious about. And then there are implementations that only work with integer keys, but which are supposed to be very fast, like Judy arrays.

Giovanni Zanandrea (Mar 24 2022 at 14:48):

      type EmployeeId = employee_id(Nat);
      type CompanyId  = company_id(Nat);

      schema Workforce {
        employee(EmployeeId)
          first_name:     String,
          last_name:      String,
          date_of_birth:  Date,
          soc_sec_no:     SocSecNum [unique];

        company(CompanyId)
          name: String;

        works_for(EmployeeId, CompanyId)
          salary:     Money,
          start_date: Date;

        works_for(e, c) -> employee(e), company(c);
      }

The above snippet of code defines what I call a relational automaton. Think of it as a sort of class that contains an entire mini-database inside. It works a bit like the Elm architecture. It can receive messages, which are processed by an update function. Automata don't share state and everything is quasi-deterministic. The major differences with TEA are that the state of the automaton is encoded using relations (as opposed to functional data structures), and that the update function instead of returning the new state returns a set of insert/update/delete and other atomic instructions that are then applied to the existing state to produce the new one (details here)

Now, that's hidden behind a lot of syntactic sugar, but the state of the above automaton is encoded using two unary relations/sets (employee and company), six binary relations (first_name, last_name, date_of_birth, soc_sec_no, name, works_for) and two ternary relations (salary and start_date), plus a set of keys and foreign keys.

Now, the relational model is IMHO by far the best way we have to encode certain types of information, but if implemented naively it's also pretty slow. So here the compiler needs to analyse how each relation is used, and pick the most efficient representation that can do the job. Let's take for example works_for. Workforce is likely to have methods that look like this:

      // Given the id of an employee, returns the set of companies it works
      // for (an employee can have more than one job)
      [CompanyId]  employers(EmployeeId e) = [c : c <- works_for(e, ?)];

      // Given the id of a company, return the set of employees
      [EmployeeId] employees(CompanyId c)  = [e : e <- works_for(?, c)];

Once the compiler see the code above, it knows that works_for has to be navigated in both direction, so it needs a more capable and expensive data structure to implement it. Same goes for soc_sec_no, which needs to guarantee the unicity of SSNs. But all the others relations, assuming that you don't actually search on any of them, are implemented using data structures that are nearly identical to the ones used in the Entity-Component-System architecture, which is used in videogames exactly because it's very fast. That's what I mean when I say that I couldn't be competive with a language like Java if the compiler wasn't free to pick the best implementation for each individual case.

We'll have to agree to disagree on this one. I believe that the job of a high level language is to work out the implementation details for you as much as possible. Even when the compiler can't do that in an optimal way because it lacks information a human would have, it can be nudged in the right direction using declarative annotations, for example. Obviously, that approach won't alway work. But isn't that true of all high-level languages? ROC itself is probably not going to completely eliminate the need for languages like C/C++/Rust/Zig, but does that mean it's not useful? I think not, and I'm actually very excited about it, so much so that I hope to be able to contribute to it at some point in the future.

Brendan Hansknecht (Mar 24 2022 at 17:55):

So if someone switches from the Java version to the C++ version, they might not be able to directly port their Java code to C++ because they need to deal with differences between the Java and C++ version of the compiler?

Sorry, with the re-arranging in user space, I meant that you could do it explicitly. I totally understand that you lose the implicit/under the hood transformations. So I don't think an explicit user space version would be at all reasonable. Also, the under the hood re-arranging (assuming no accidentally noticable side effects) is a really cool idea.

Your record example is very intriguing. How do you actually decide when to swap implementation and not use a struct?

Complete aside, I was an SWE intern at a hospital company and this was not true in their data....which was concerning to say the least. They probably should be using the contradictions to detect fraud.

Thanks for the clarification. I guess the core information here is that your language is far from implementation. So if someone were to make the same functionality in raw Java, they might go about it in a totally different way. So you language has to consider a lot around lowering to get competitive speeds.

Yeah, as I said, I tend to work pretty low level, so I have a lot of bias. I just generally have found being lower and having more control was worth it for the performance. I think a lot of this discussion depends on how high level Roc actually is. I want to use Roc as a low level langauge. I want to be able to use it were I would currently reach for Rust or C++. With that in mind, I don't want to heavily depend on compiler heuristics for performance or pay the cost of "under the hood optimizations" that I don't actually need. So I understand both sides, but my want for "high level" is much much lower than your form of "high level".

On the other hand, with something high level like python or javascript or SQL, I would gladly take your optimizations as long as they aren't too slow to compile/interpret.

Roc depends on C/C++/Rust/Zig. You can't compile a Roc app without a platform, and a platform has to be written in a language like those. As I allude to earlier, my hope would just be that roc is essentially as fast and only slightly higher level than those languages, but while being a pure functional programming language. That will be had to do at scale, but I think the project will manage this for the most part.

Kevin Gillette (Mar 26 2022 at 04:11):

Hah, I wish. I haven't found anything in the accessible literature. I've pretty much using what is described in https://en.wikipedia.org/wiki/Merge_algorithm (iow, "MergeSort"), except that the inputs are sets (lists) that are _already_ sorted (and are deduplicated), and the merge details vary based on the set operation, for example set intersection skips elements from either list that are not present in both lists; and none of the operations produce duplicates as a merge sort would. Also, there are opportunities to use binary search for long non-overlapping sequences, and some of the techniques found in Timsort.

The main downside seems to be that single-element membership tests are logarithmic time rather than amortized constant time as you'd get with a hashmap, though multiple-element membership tests ("are all present" via subset checking, and "are any present" via intersection checking) does pretty well, since some elements can be easily excluded by comparing min/max set values, and the inherent sorted order can provide some optimizations that just aren't available to a hashmap.

iiuc, with sets built on pure hashmaps using hash functions with good distribution, the "amortized constant time" factor can add up to a significant/notable amount of overhead when performing pair-of-sets operations (like union, intersection, etc): if the two sets have identical hashing functions, then at least the hash cost would not need to be paid more than once per element; if the two sets also have an identical number of buckets (same approximate size/capacity), then hashing doesn't need to be performed, since buckets can just be walked pair-wise; but in any case, assuming each bucket has a list of 8-16 ("k") elements that are in arbitrary order, that means a k-element walk in set b for every element in set a for an operation (if searching for a non-match, such as set difference, otherwise average k/2 if searching for a match that happens to exist, such as in set intersection). If a hash is needed, and the elements are long strings (assuming there's no auto-interning of strings), then hash can be arbitrarily expensive: sorted-list sets don't have this issue, nor the k-element sub-walk overhead.

Kevin Gillette (Mar 26 2022 at 04:56):

I generally agree with this. C++ is a language that has ~80 keywords, many/most of which are defunct or have been repurposed because C++' goal was to provide performance _now_ even when we could be confident that smarter compilers would obsolete the value of that low level control later. I'm not convinced C++' approach is in any way sound for a modern language: if we believe something _can_ be optimized later, maybe we can provide an escape hatch of some kind, but the core language or libraries should probably not express distinctions which are relevant to performance unless such details also better clarify intent ("I'm using this kind of mapping because it has these powerful _extra_ capabilities I need," rather than "I'm using this kind of mapping because, while it has an identical API to all the others, it has better performance in my specific use case").

Having one (or very few) standard contract for a hashmap, list, etc, does mean that it can more flexibly be optimized in various and transparent ways, such as @Giovanni Zanandrea is describing. Also, having a single simple implementation is also a form of optimization considering it has great instruction cache utilization compared to many compiled datastructure variants. The point is that, at least for the kind of work I do (I'm certain it varies by person and hence is a great discussion), I need _pretty good_ performance and _pretty reasonable_ overhead, but once that's obtained, prefer simplicity above optimal performance or other concerns.

Perhaps there's some middle ground where the language provides a default Dict type with a variety of capabilities, which compiles into one or another Dict implementation residing within IHopeYouKnowWhatYourDoing.Dict submodules depending on which capabilities are being used (if you never iterate over the Dict, then a lower-overhead order-undefined classic hashmap can be selected, for example). These specialized Dict implementations could be used directly by those who need direct control.

However, in my experience, these specialized implementations should probably not be named WyHash or FnvHash or whatever, but rather named after their general tradeoffs, and only document that "right now this uses Fnv." Remember that md5 was considered pretty good and general purpose at one point (albeit an entirely different kind of hashing function). More to the point, various bespoke or temporarily well-regarded dictionary hashing algorithms, that turn out to have distribution problems or pathological cases, have also come and gone over the years; or if they were actually pretty good, then regularly enough something that is strictly better at achieving the same tradeoffs comes along (same memory but faster on all the same inputs, or same speed but lower memory, or same speed and memory but better distribution).

Naming implementations after their low level implementation details (rather than high-level capabilities or tradeoffs) may lead to a situation in which we really regret adding a type to the stdlib, and really want people to stop using it, and at that point we're stuck if we care about retaining compatibility for existing code.

Brendan Hansknecht (Mar 26 2022 at 05:14):

I totally agree. I use C++ because I have to and it matters for a lot of my work (rust would also be mostly valid). I personally would love something akin to zig++ where it is C++ but with a clean start that is willing to delete and simplify. Zig is great, but I am not a big fan of C, so I don't really want the best version of C.

Though compilers are getting better and smarter, there are many cases where either it costs to much to run an optimization (try n types of HashMap and pick the best) or the optimizations are outside the scope of the compiler (you wrote your logic like x, let me fundamentally understand that and write it with a new algorithm).

Due to these kinds of limitations, there will always be a reason to care about the low level details no matter how good the compiler gets. Using extra cores or memory when scaled to data centers has huge monetary and environmental costs. So saving 0.1% of cpu usage might actually really matter for some operations.

Brendan Hansknecht (Mar 26 2022 at 05:18):

I think for probably 95+% of code, this is the case. I think it is always good to keep that in mind especially since Roc is linked with a platform and the platform can do low level things if need (though it has more overhead than calling a builtin).
Also, in general, I am worried that analyzing every list or map or etc to potentially swap out the implementation will have very high compile time costs that we likely don't want.

Brendan Hansknecht (Mar 26 2022 at 05:19):

100% agree. Unless we want to keep it around forever we need to be careful of these things. Especially since we might even want to change out the hash depending on the type.

Brendan Hansknecht (Mar 26 2022 at 05:20):

Brendan Hansknecht (Mar 26 2022 at 05:21):

So we should make sure that whatever we do, we are committing to behaviour, not implementation.

Giovanni Zanandrea (Mar 26 2022 at 11:21):

But at that point, wouldn't it be preferrable to specify the hashing algorithm as a compiler option? We could have a configuration file where we declaratively control all the things that can introduce nondeterminism in the language. Not just hashing algorithms, but also things like Num.appSeed, which Brendan Hansknecht mentioned in another post. The developer would still have a pretty good degree of control over these things, and we wouldn't have to worry about backward compatibility because newer versions of the compiler could keep supporting the old configuration options while adding new ones.

Giovanni Zanandrea (Mar 26 2022 at 11:22):

Correct. I think it would be trivial to maintain full compatibility between the C/C++ code generator and a (hypothetical) one that generates Zip or (unsafe) Rust code without affecting performance in any way, because all those languages give you full control over what the machine is doing. But compiling to a completely alien data model like Java's creates a lot of additional difficulties, and something has to give.

But as I said in a previous post, I've never had a bug that showed up in one of the implementations but not the others (and the compiler is self-hosted, so I've written plenty of code in my own language). The real issue is divergent but still valid outputs, which is a testability nightmare.

Giovanni Zanandrea (Mar 26 2022 at 11:23):

Right now I just optimize all tagged records that appear in a type definition. The type system I use can't do type inference anyway, and beside I like to explicitly define my data structures, so that is good enough for me. But you could also use other criteria as well.

Giovanni Zanandrea (Mar 26 2022 at 11:24):

I'd like to point out that even when compiling to Java (and as I said, Java is a dreadful compilation target in a case like mine) my language is already competitive with OO Java and often faster than OO C# for data/state-intensive application (some benchmarks here). With a few additional optimization, which are not too hard to implement, I think the C++ code generator will be able to clearly outperform Java in that domain. So it's nowhere in the same league as python or ruby, let alone SQL, when it comes to performance.

Obviously, there are algorithms that are hard to implement efficiently in a referentially transparent language (even one like mine that allows some local mutability), so Java will have an edge there. But there might be a way to combine the relational model with a ROC-like language that does opportunistic mutation, and solve even that problem.

And the relational model offers opportunities for optimizations that I haven't even began to take advantage of. They would require the addition of a declarative Datalog-like language, which at the moment I don't have. But with that in place, I can think of use cases where it would be possible to provide a level of performance that would be very difficult in practice to achieve even in C. Not impossible of course, since C is basically glorified machine language. But it would take an awful lot of work to replicate what in a declarative language could be done automatically by the compiler.

Compilation speed is indeed a bit of an issue right now, but the main bottleneck is the structural type system. Optimization of data structures is actually pretty fast.

Giovanni Zanandrea (Mar 26 2022 at 11:25):

If this "optimization by opportunistic mutation" thing pans out I suspect it could be a real game changer for functional programming. I really hope it works.

Brendan Hansknecht (Mar 26 2022 at 14:38):

Sorry if my comment came off as attacking the speed of your language. I was trying to say that the higher level a language is the more ok I am with it swapping things out under the hood. As you move up the stack you are giving up control for convenience. Your language looks super convenient for it's use case, but the end user has very little control over final assembly executed.

Giovanni Zanandrea (Mar 27 2022 at 08:01):

No worries, your comment did not come off as an attack and I didn't take it as such. I just wanted to point out that high-levelness and good performance are perfectly compatible

Stream: ideas

Topic: Hash Maps & Sets

Derek Gustafson (Mar 07 2022 at 20:52):

Brendan Hansknecht (Mar 07 2022 at 20:52):

Brendan Hansknecht (Mar 07 2022 at 20:53):

Brendan Hansknecht (Mar 07 2022 at 20:54):

Derek Gustafson (Mar 07 2022 at 20:54):

Brendan Hansknecht (Mar 07 2022 at 20:55):

Derek Gustafson (Mar 07 2022 at 20:55):

Derek Gustafson (Mar 07 2022 at 20:56):

Brendan Hansknecht (Mar 07 2022 at 20:58):

Derek Gustafson (Mar 07 2022 at 21:00):

Brendan Hansknecht (Mar 07 2022 at 21:01):

Ayaz Hafiz (Mar 07 2022 at 21:02):

Derek Gustafson (Mar 07 2022 at 21:02):

Brendan Hansknecht (Mar 07 2022 at 21:02):

Derek Gustafson (Mar 07 2022 at 21:03):

Brendan Hansknecht (Mar 07 2022 at 21:05):

Derek Gustafson (Mar 07 2022 at 21:06):

Brendan Hansknecht (Mar 07 2022 at 21:07):

Brendan Hansknecht (Mar 07 2022 at 21:08):

Derek Gustafson (Mar 07 2022 at 21:09):

Brendan Hansknecht (Mar 07 2022 at 21:09):

Ayaz Hafiz (Mar 07 2022 at 21:09):

Richard Feldman (Mar 07 2022 at 21:10):

Richard Feldman (Mar 07 2022 at 21:10):

Derek Gustafson (Mar 07 2022 at 21:11):

Derek Gustafson (Mar 07 2022 at 21:12):

Derek Gustafson (Mar 07 2022 at 21:13):

Brendan Hansknecht (Mar 07 2022 at 21:18):

Derek Gustafson (Mar 07 2022 at 21:22):

Ayaz Hafiz (Mar 07 2022 at 21:23):

Brendan Hansknecht (Mar 07 2022 at 21:24):

Brendan Hansknecht (Mar 07 2022 at 21:25):

Richard Feldman (Mar 07 2022 at 21:37):

Richard Feldman (Mar 07 2022 at 21:37):

Richard Feldman (Mar 07 2022 at 21:37):

Richard Feldman (Mar 07 2022 at 21:38):

Richard Feldman (Mar 07 2022 at 21:38):

Folkert de Vries (Mar 07 2022 at 21:43):

Folkert de Vries (Mar 07 2022 at 21:43):

Richard Feldman (Mar 07 2022 at 21:46):

Richard Feldman (Mar 07 2022 at 21:46):

Richard Feldman (Mar 07 2022 at 21:57):

Richard Feldman (Mar 07 2022 at 21:58):

Richard Feldman (Mar 07 2022 at 21:58):

Richard Feldman (Mar 07 2022 at 21:59):

Richard Feldman (Mar 07 2022 at 21:59):

Richard Feldman (Mar 07 2022 at 21:59):

Richard Feldman (Mar 07 2022 at 21:59):

Richard Feldman (Mar 07 2022 at 22:00):

Ayaz Hafiz (Mar 07 2022 at 22:02):

Richard Feldman (Mar 07 2022 at 22:21):

Ayaz Hafiz (Mar 07 2022 at 22:26):

Folkert de Vries (Mar 07 2022 at 22:39):

Folkert de Vries (Mar 07 2022 at 22:40):

Folkert de Vries (Mar 07 2022 at 22:40):

Brendan Hansknecht (Mar 07 2022 at 23:02):

Brendan Hansknecht (Mar 07 2022 at 23:02):

Brendan Hansknecht (Mar 07 2022 at 23:03):

Ayaz Hafiz (Mar 07 2022 at 23:08):

Brendan Hansknecht (Mar 07 2022 at 23:10):

Richard Feldman (Mar 07 2022 at 23:18):

Richard Feldman (Mar 07 2022 at 23:18):

Richard Feldman (Mar 07 2022 at 23:19):

Richard Feldman (Mar 07 2022 at 23:19):

Richard Feldman (Mar 07 2022 at 23:20):

Richard Feldman (Mar 07 2022 at 23:20):

Brendan Hansknecht (Mar 08 2022 at 00:11):

Brendan Hansknecht (Mar 08 2022 at 00:17):

Jared Cone (Mar 08 2022 at 00:32):

Richard Feldman (Mar 08 2022 at 01:31):

Richard Feldman (Mar 08 2022 at 01:31):

Richard Feldman (Mar 08 2022 at 01:31):

Richard Feldman (Mar 08 2022 at 01:32):

Richard Feldman (Mar 08 2022 at 01:58):

Richard Feldman (Mar 08 2022 at 01:58):

Brendan Hansknecht (Mar 08 2022 at 02:46):

Richard Feldman (Mar 08 2022 at 03:16):

Richard Feldman (Mar 08 2022 at 03:16):