Calculating name attributes during parsing · compiler development

I've been working on efficiently calculating attributes for idents, including legal attributes like ! suffix, _ suffix, etc. as well as problems like double underscores. I think this information is useful throughout the rest of the compiler, so it feels appropriate to store these in a byte as bitflags and pass them around the parsing code that way. It would require slightly more space for every string we keep in the AST, but it would ensure we don't have to repeat parsing work for every single ident in the compiler.

Does anyone think this is a bad idea? I'm gonna try this out post implementing string interpolation, but will avoid that work if people think it's better to duplicate work rather than store more info in the AST

Sam Mohr (Jan 10 2025 at 16:10):

The reason why this is tempting to me is because I'm trying to report all problems with an ident in a single message for every string we see in Roc

Sam Mohr (Jan 10 2025 at 16:12):

And it seems like we calculate a lot of these attributes later during canonicalization, but we can actually just do it while chomping characters. That way, we don't have to worry about handling non-ASCII or whatever in calculating in roc_can because we know exactly what we parsed earlier.

Sam Mohr (Jan 10 2025 at 16:13):

Currently, we give a runtime error if a function has two underscores. I think we should just parse it, give a warning, and run anyway

Ayaz Hafiz (Jan 10 2025 at 16:22):

i don't have a preference for the specific mechanism but I think these should definitely be consolidated behind an API and the specific API implementation should be opaque to the caller. Ideally these are methods is_effectful etc on the ident or symbol itself. If that's not possible, they should be methods on whatever object manages the symbol.

Sam Mohr (Jan 10 2025 at 16:22):

Sam Mohr (Jan 10 2025 at 16:23):

struct Ident<'a> {
    text: &'a str,
    attributes: u8,
    problems: u8,
}

Sam Mohr (Jan 10 2025 at 16:23):

Sam Mohr (Jan 10 2025 at 16:24):

Ayaz Hafiz (Jan 10 2025 at 16:24):

Sam Mohr (Jan 10 2025 at 16:24):

Richard Feldman (Jan 10 2025 at 18:31):

Sam Mohr (Jan 10 2025 at 19:15):

And these are the current ident errors, (though some may go away without needing to verify the ident parsed):

My plan for these errors is to combine them into a single warning per ident with a list of errors to maximize info without clustering multiple warnings on the same ident

Sam Mohr (Jan 10 2025 at 19:15):

For the flags, not all of them are needed the whole time. The uppercase thing isn't needed once we figure out if a thing is a tag/alias or a var/function/value/typevar

Sam Mohr (Jan 10 2025 at 19:18):

Underscore at the start gets the ident ignored, but we probably still want to track the ident in case someone tries to run code that looks like

_val = 123
do_something_else!()?
_val = 456

Sam Mohr (Jan 10 2025 at 19:18):

The current technique that Agus is employing is using the first bit of the 32bit ident ID as a bitflag for whether the var is ! suffixed

Sam Mohr (Jan 10 2025 at 19:19):

Sam Mohr (Jan 10 2025 at 19:20):

We may not even need 4 bits, but if we use 4 bits that doesn't feel like much of a cost

Sam Mohr (Jan 10 2025 at 19:22):

This change also means that we probably can get away with not storing string values during parsing. We can just store a 64-bit rolling hash for each string and check for equality based on that value.

Sam Mohr (Jan 10 2025 at 19:23):

If you need the original string, you use the region of the string to get it back

Ayaz Hafiz (Jan 10 2025 at 19:27):

Sam Mohr (Jan 10 2025 at 19:27):

Ayaz Hafiz (Jan 10 2025 at 19:28):

yeah but the strings are small and that only matters before you turn the string into a symbol. it should be benchmarked but i would be a lot of money that is very low on the flamegraph

Sam Mohr (Jan 10 2025 at 19:29):

Ayaz Hafiz (Jan 10 2025 at 19:29):

Ayaz Hafiz (Jan 10 2025 at 19:30):

debugability is probably more important than perf in the current implementation though

Sam Mohr (Jan 10 2025 at 19:30):

But if strings are always hashed already, then you can use them as keys in a proper hashmap

Sam Mohr (Jan 10 2025 at 19:30):

Ayaz Hafiz (Jan 10 2025 at 19:30):

Sam Mohr (Jan 10 2025 at 19:31):

Ayaz Hafiz (Jan 10 2025 at 19:32):

Ayaz Hafiz (Jan 10 2025 at 19:33):

Sam Mohr (Jan 10 2025 at 19:33):

Sam Mohr (Jan 10 2025 at 19:34):

This is where we walk the entire string interner every time we look for a symbol that matches what we have in scope

Sam Mohr (Jan 10 2025 at 19:34):

It'd be nice for the interner to get closer to O(1) than O(n) by hashing the strings for insert

Sam Mohr (Jan 10 2025 at 19:35):

The way to avoid that is to have your string interners keys be the hashes of the strings, and the values be Vec<(&'a str, other, data)>

Sam Mohr (Jan 10 2025 at 19:36):

I'll be improving this performance by maintaining a separate data structure in the can rewrite for tracking ident IDs in scope

Sam Mohr (Jan 10 2025 at 19:36):

Sam Mohr (Jan 10 2025 at 19:37):

Sam Mohr (Jan 10 2025 at 19:38):

But O(n) searches of all idents interned so far every time we want to deduplicate ident IDs by string search seems pretty sus

Ayaz Hafiz (Jan 10 2025 at 19:39):

the way i would do it is to keep a hash map instead. then look up by hash and check for any collisions. if there are no collisions insert

Ayaz Hafiz (Jan 10 2025 at 19:40):

i wouldn't get too nerdsniped by this though. the really expensive parts are typechecking and monomorphization. everything else is kind of a drop in the bucket comparatively

Richard Feldman (Jan 10 2025 at 19:42):

Sam Mohr (Jan 10 2025 at 19:42):

Richard Feldman (Jan 10 2025 at 19:42):

but on the level of an individual data structure, we can optimize those later as desired

Ayaz Hafiz (Jan 10 2025 at 20:45):

Ayaz Hafiz (Jan 10 2025 at 20:46):

the reason c++/rust style languages have slow compilers is because of specialization

Ayaz Hafiz (Jan 10 2025 at 20:46):

Ayaz Hafiz (Jan 10 2025 at 20:47):

everything else doesn't matter (comparatively). specialization forces a potentially-exponential growth in program size, and hence compute during compilation

Sam Mohr (Jan 10 2025 at 20:51):

Sam Mohr (Jan 10 2025 at 20:53):

That makes me wonder if long-term, Roc would have a faster dev backend startup time if we did dynamic dispatch for everything

Sam Mohr (Jan 10 2025 at 20:53):

Richard Feldman (Jan 10 2025 at 20:53):

I'm not sure about that...seems like Zig and JAI are way faster than Rust or C++ even though they both specialize, and they specialize via compile-time interpreting of source code, which is likely to be slower than parametric polymorphism type inference

Richard Feldman (Jan 10 2025 at 20:54):

"specialization is costly" is definitely true, but "specializing compilers are all slow" is empirically false thanks to those counterexamples, so there is obviously more to the story

Ayaz Hafiz (Jan 10 2025 at 20:56):

i'm talking about relative speeds, not in absolute terms. zig is faster than c++ and rust but that doesn't mean that zig without specialization would be just as fast as zig with specialization. it would certainly be faster

Richard Feldman (Jan 10 2025 at 20:56):

Ayaz Hafiz (Jan 10 2025 at 20:56):

c++ and rust have a lot of other stuff that doesn't help. auto derefs are a good example

Richard Feldman (Jan 10 2025 at 20:56):

Ayaz Hafiz (Jan 10 2025 at 20:57):

but what i mean is, in terms what is the bottleneck for any given compiler, specialization is definitely at the top

Richard Feldman (Jan 10 2025 at 20:57):

Ayaz Hafiz (Jan 10 2025 at 20:57):

Richard Feldman (Jan 10 2025 at 20:58):

well, after linking and code gen in the case of LLVM - although specialization does create more work for both of those steps

Sam Mohr (Jan 10 2025 at 20:58):

Ayaz Hafiz (Jan 10 2025 at 20:58):

Richard Feldman (Jan 10 2025 at 20:58):

Sam Mohr (Jan 10 2025 at 20:58):

Richard Feldman (Jan 10 2025 at 20:58):

Sam Mohr (Jan 10 2025 at 20:58):

Ayaz Hafiz (Jan 10 2025 at 20:58):

Ayaz Hafiz (Jan 10 2025 at 20:59):

Sam Mohr (Jan 10 2025 at 20:59):

I get that specialization gen is bigger than O(n), but just making concretely-typed copies seems cheap even if you do a lot of it

Ayaz Hafiz (Jan 10 2025 at 20:59):

and it's compounded by the fact that you have to duplicate data, etc (and there is little getting around that)

Sam Mohr (Jan 10 2025 at 20:59):

But the part that would suck is everything after it working on more than the original N functions/values

Ayaz Hafiz (Jan 10 2025 at 21:02):

Sam Mohr (Jan 10 2025 at 21:02):

Sam Mohr (Jan 10 2025 at 21:03):

Oh, even just the naive graph expansion is bottleneck number one for relative time spent by compiler passes??

Ayaz Hafiz (Jan 10 2025 at 21:04):

there are things that you can do to make the constant cost better, eg avoid over allocating, have a dense union/find representation of types, compiling to a good representation (e.g. passing large objects by pointer instead of by value everywhere). but the non-constant cost can't change

Ayaz Hafiz (Jan 10 2025 at 21:06):

maybe a better way to put it is anything that is rewriting the program to a non-constant equivalent is going to be an expensive part. optimizations are another good example of this

Ayaz Hafiz (Jan 10 2025 at 21:09):

another kind of interesting thing about specialization is that it is inherently at odds with separate compilation

Ayaz Hafiz (Jan 10 2025 at 21:09):

this is why you don't really have separate compilation in Rust and in C++ you have to include generic functions in header files (so you have separate compilation until you need a generic function)

Ayaz Hafiz (Jan 10 2025 at 21:09):

Ayaz Hafiz (Jan 10 2025 at 21:10):

there are ways to get around this though. for example boxing values at compilation unit boundaries so there is a uniform representation. This is what Swift to separate comp/dynamic linking with generics IIRC

Richard Feldman (Jan 10 2025 at 21:14):

Richard Feldman (Jan 10 2025 at 21:15):

like if we give a List Str to the host, it needs to look the same way in memory with or without optimizations, and we definitely want specialization in optimized builds even if we wanted dev builds to do something else

Ayaz Hafiz (Jan 10 2025 at 21:16):

i honestly think the long term answer for dev builds if you care about compiler perf (for any specializing compiler) is an interpreter

Ayaz Hafiz (Jan 10 2025 at 21:17):

in roc's case the interpreter would have some c ffi and the rest is just interpreted, no specialization

Ayaz Hafiz (Jan 10 2025 at 21:18):

you also don't need to worry about it for the host ABI because the host ABI has to be concretized - there is no polymorphism over it

Ayaz Hafiz (Jan 10 2025 at 21:19):

Richard Feldman (Jan 10 2025 at 21:30):

Brendan Hansknecht (Jan 10 2025 at 21:33):

Yeah, this is probably correct. Probably a bytecode interpreter like python. (Though this does leave weird holes where dev and optimized differ even more).

Brendan Hansknecht (Jan 10 2025 at 21:35):

Like host calls into roc (which just launches an interpreter over the roc bytecode). When calling back into the host, just use libffi to deal with all the types and calling the host pointers.

Brendan Hansknecht (Jan 10 2025 at 21:36):

Likely still want to do at a minimum tail call optimization, but most things could remain boxed and etc to run faster.

Brendan Hansknecht (Jan 10 2025 at 21:37):

The problem with the dev backend compilation time long term, isn't the dev backend. It is all the parts above the dev backend to deal with type checking and specialization.

Brendan Hansknecht (Jan 10 2025 at 21:38):

Those are the slow parts that benefit signifcantally from a boxed representation and such.

Brendan Hansknecht (Jan 10 2025 at 21:40):

Also, I'm not sure which would be faster, dev assembly or a bytecode interpreter. Probably still the dev assembly, but less clear especially if you do some smart bytecode assembler tricks (not to mention some lambda chaining tricks that get pretty optimized code without writing any assembly).

Brendan Hansknecht (Jan 10 2025 at 21:41):

Obviously, we could box everything and do assembly gen fast just change the assembly to understand boxed types.

Ayaz Hafiz (Jan 10 2025 at 21:55):

Richard Feldman (Jan 10 2025 at 21:56):

yeah to Brendan's point, interpreter vs fast machine code generator is orthogonal to specialization

Richard Feldman (Jan 10 2025 at 21:57):

the real benefit would be if we could have the dev backend do something other than specialization at the cost of (much) more boxing at runtime

Richard Feldman (Jan 10 2025 at 21:57):

and yet at the boundary convert things over to the representation the host needs

Richard Feldman (Jan 10 2025 at 21:57):

Ayaz Hafiz (Jan 10 2025 at 21:58):

well the great thing is you don't need the boundary conversion at the host, that's what im saying

Brendan Hansknecht (Jan 10 2025 at 21:59):

What about something like List Str? in the boxed non-specialized world, wouldn't it be List GenericRocType

Richard Feldman (Jan 10 2025 at 21:59):

hm, but unless I'm missing something, List Str would be different in memory depending on whether it's been monomorphized or not

Ayaz Hafiz (Jan 10 2025 at 21:59):

i don't think it's that expensive. it's only a pointer dereference and indirect jumps for abilities

Richard Feldman (Jan 10 2025 at 21:59):

Ayaz Hafiz (Jan 10 2025 at 21:59):

Richard Feldman (Jan 10 2025 at 21:59):

Richard Feldman (Jan 10 2025 at 22:00):

Ayaz Hafiz (Jan 10 2025 at 22:00):

sure, but you don't need to box it because you know that the function will always have the same concrete inputs and outputs

Richard Feldman (Jan 10 2025 at 22:00):

but List Str and List U8 have different memory layouts if they've been specialized

Ayaz Hafiz (Jan 10 2025 at 22:01):

Anthony Bullard (Jan 10 2025 at 22:01):

Other languages are more free to do this because the language runtime and the "platform" are not separated and are cohesive

Richard Feldman (Jan 10 2025 at 22:01):

but let's say I have a call to List.map which returns a List Str which I hand off to the platform

Richard Feldman (Jan 10 2025 at 22:02):

how can List.map return a List Str sometimes and a List U8 other times I call it, without either specializtion or some runtime metadata on List?

Ayaz Hafiz (Jan 10 2025 at 22:02):

Brendan Hansknecht (Jan 10 2025 at 22:03):

Wait, your suggestion is more meta than I realized. You don't want to box everything, You just want to box anything that has more than one specialization. That said, you want to box only over the possible specializations. So for a function that in practice has two specializations List Str -> ... or List U8 -> ..., you would essentially convert it into a [ ListStr (List Str), ListU8 (List U8) ] -> ...

Brendan Hansknecht (Jan 10 2025 at 22:03):

So very different from the standard interpreter that boxes everything and does so in a recursive manner.

Ayaz Hafiz (Jan 10 2025 at 22:04):

i don't think you can pass a function to the host that is polymorphic for exactly that reason - the host has to have a concrete (non-generic) interface so i'm not sure how you would pass such a function through?

Ayaz Hafiz (Jan 10 2025 at 22:04):

like I can't say on the host side "i expose an effect f: Str -> (a -> a) -> Str"

Ayaz Hafiz (Jan 10 2025 at 22:05):

you have to say f: Str -> (Str -> Str) -> Str at which point you know exactly what specialization you need and you don't need anything more

Ayaz Hafiz (Jan 10 2025 at 22:07):

Yes, that could work too. But i'm not quite following how the host can take any function that is generic. I don't think it's possible but maybe i'm missing something

Brendan Hansknecht (Jan 10 2025 at 22:08):

I'm not worried about directly interacting with the host. I am worried about indirectly interacting with the host.

Ayaz Hafiz (Jan 10 2025 at 22:08):

Brendan Hansknecht (Jan 10 2025 at 22:09):

I thought part of the goal of this work was to avoid specialization. As such, even though the host has concrete types, it has to be able to call functions that have multiple specializations and thus take generic boxed types. As such, there has to be a way to convert concrete types to/from generic types.

Brendan Hansknecht (Jan 10 2025 at 22:10):

Aside, I think an interpreter could theoretically avoid boxing and lazily specialize when a function with a specific specialization is called.

Ayaz Hafiz (Jan 10 2025 at 22:11):

My argument is that the host can never interact with any Roc function that is polymorphic, hence you don't need to worry about generics over the host boundary. Because how could that work if it did? You would need to re-compile the host every time the app uses another specialization

Ayaz Hafiz (Jan 10 2025 at 22:12):

Ayaz Hafiz (Jan 10 2025 at 22:14):

Sam Mohr (Jan 10 2025 at 22:20):

Ayaz Hafiz (Jan 10 2025 at 23:03):

i'm still confused on if i'm missing something about the app<>host boundary. am i? or is it correct that the host actually has no way to call a polymorphic function

Brendan Hansknecht (Jan 10 2025 at 23:28):

Brendan Hansknecht (Jan 10 2025 at 23:29):

Ayaz Hafiz (Jan 10 2025 at 23:29):

but that has to be equivalent to a single function, it can't dispatch to one of N functions right

Ayaz Hafiz (Jan 10 2025 at 23:29):

Brendan Hansknecht (Jan 10 2025 at 23:29):

Ayaz Hafiz (Jan 10 2025 at 23:29):

Ayaz Hafiz (Jan 10 2025 at 23:30):

Brendan Hansknecht (Jan 10 2025 at 23:30):

Ayaz Hafiz (Jan 10 2025 at 23:30):

okay, so then this works just fine at host boundaries. you don't need any boxing at all.

Brendan Hansknecht (Jan 10 2025 at 23:31):

I just don't understand how without specializing list.map, we go from the concrete list passed in from the host to calling a polymorphic function.

Brendan Hansknecht (Jan 10 2025 at 23:33):

main : List Str -> List U64
main =
    if #some arbitrary conditional then
        List.map Str.toU64
    else
        List.map Str.toI64
        |> List.map Num.abs

Ayaz Hafiz (Jan 10 2025 at 23:35):

Ayaz Hafiz (Jan 10 2025 at 23:36):

I was talking about 3p functions at compilation boundaries (let's say libraries for now)

Ayaz Hafiz (Jan 10 2025 at 23:37):

but the case is similar. in this example you would simply unwrap the boxed type at the end and return that.

Richard Feldman (Jan 10 2025 at 23:37):

yeah the problem with that is what if you have like List (List Str) or something

Ayaz Hafiz (Jan 10 2025 at 23:37):

Richard Feldman (Jan 10 2025 at 23:38):

Ayaz Hafiz (Jan 10 2025 at 23:39):

i don't think that's necessarily true. you only need to do the wrap/unwrap if you definitely know you are calling a polymorphic-but-unspecialized function. which in practice is going to be a small amount of the functions in your program (since it's only at 3p library API boundaries)

Ayaz Hafiz (Jan 10 2025 at 23:40):

Also, it doesn't need to be nested unwrap. If you pass List (List Str) to a function like that, you only need to wrap the whole structure, not the nested ones

Stream: compiler development

Topic: Calculating name attributes during parsing

Sam Mohr (Jan 10 2025 at 16:08):

Sam Mohr (Jan 10 2025 at 16:10):

Sam Mohr (Jan 10 2025 at 16:12):

Sam Mohr (Jan 10 2025 at 16:13):

Sam Mohr (Jan 10 2025 at 16:13):

Ayaz Hafiz (Jan 10 2025 at 16:22):

Sam Mohr (Jan 10 2025 at 16:22):

Sam Mohr (Jan 10 2025 at 16:23):

Sam Mohr (Jan 10 2025 at 16:23):

Sam Mohr (Jan 10 2025 at 16:23):

Sam Mohr (Jan 10 2025 at 16:24):

Ayaz Hafiz (Jan 10 2025 at 16:24):

Sam Mohr (Jan 10 2025 at 16:24):

Richard Feldman (Jan 10 2025 at 18:31):

Richard Feldman (Jan 10 2025 at 18:31):

Sam Mohr (Jan 10 2025 at 19:15):

Sam Mohr (Jan 10 2025 at 19:15):

Sam Mohr (Jan 10 2025 at 19:18):

Sam Mohr (Jan 10 2025 at 19:18):

Sam Mohr (Jan 10 2025 at 19:19):

Sam Mohr (Jan 10 2025 at 19:19):

Sam Mohr (Jan 10 2025 at 19:20):

Sam Mohr (Jan 10 2025 at 19:22):

Sam Mohr (Jan 10 2025 at 19:23):

Ayaz Hafiz (Jan 10 2025 at 19:27):

Sam Mohr (Jan 10 2025 at 19:27):

Sam Mohr (Jan 10 2025 at 19:27):

Ayaz Hafiz (Jan 10 2025 at 19:28):

Sam Mohr (Jan 10 2025 at 19:29):

Sam Mohr (Jan 10 2025 at 19:29):

Sam Mohr (Jan 10 2025 at 19:29):

Ayaz Hafiz (Jan 10 2025 at 19:29):

Ayaz Hafiz (Jan 10 2025 at 19:30):

Sam Mohr (Jan 10 2025 at 19:30):

Sam Mohr (Jan 10 2025 at 19:30):

Ayaz Hafiz (Jan 10 2025 at 19:30):

Sam Mohr (Jan 10 2025 at 19:31):

Sam Mohr (Jan 10 2025 at 19:31):

Ayaz Hafiz (Jan 10 2025 at 19:32):

Ayaz Hafiz (Jan 10 2025 at 19:33):

Sam Mohr (Jan 10 2025 at 19:33):

Sam Mohr (Jan 10 2025 at 19:33):

Sam Mohr (Jan 10 2025 at 19:34):

Sam Mohr (Jan 10 2025 at 19:34):

Sam Mohr (Jan 10 2025 at 19:35):

Sam Mohr (Jan 10 2025 at 19:36):

Sam Mohr (Jan 10 2025 at 19:36):

Sam Mohr (Jan 10 2025 at 19:37):

Sam Mohr (Jan 10 2025 at 19:38):

Ayaz Hafiz (Jan 10 2025 at 19:39):

Ayaz Hafiz (Jan 10 2025 at 19:40):

Richard Feldman (Jan 10 2025 at 19:42):

Sam Mohr (Jan 10 2025 at 19:42):

Richard Feldman (Jan 10 2025 at 19:42):

Richard Feldman (Jan 10 2025 at 19:42):

Ayaz Hafiz (Jan 10 2025 at 20:45):

Ayaz Hafiz (Jan 10 2025 at 20:46):

Ayaz Hafiz (Jan 10 2025 at 20:46):

Ayaz Hafiz (Jan 10 2025 at 20:47):

Sam Mohr (Jan 10 2025 at 20:51):

Sam Mohr (Jan 10 2025 at 20:53):

Sam Mohr (Jan 10 2025 at 20:53):

Richard Feldman (Jan 10 2025 at 20:53):

Richard Feldman (Jan 10 2025 at 20:54):

Ayaz Hafiz (Jan 10 2025 at 20:56):

Richard Feldman (Jan 10 2025 at 20:56):

Ayaz Hafiz (Jan 10 2025 at 20:56):

Richard Feldman (Jan 10 2025 at 20:56):

Ayaz Hafiz (Jan 10 2025 at 20:57):

Richard Feldman (Jan 10 2025 at 20:57):

Ayaz Hafiz (Jan 10 2025 at 20:57):

Richard Feldman (Jan 10 2025 at 20:58):

Sam Mohr (Jan 10 2025 at 20:58):

Sam Mohr (Jan 10 2025 at 20:58):

Ayaz Hafiz (Jan 10 2025 at 20:58):

Richard Feldman (Jan 10 2025 at 20:58):

Sam Mohr (Jan 10 2025 at 20:58):

Richard Feldman (Jan 10 2025 at 20:58):