Stream: compiler development

Topic: simplifying internal abi


view this post on Zulip Brendan Hansknecht (Dec 15 2024 at 19:57):

As part of planned changes for our llvm c abi, I think I want to first simplify some of our llvm internal abi.

Currently, we sometimes pass values in registers and sometimes pass them by pointer. When we pass them by pointer, we never pass them byval or byref. Our heuristic for passing in registers is simply based on the size in bytes. This actually leads to some pretty bad code gen. It also is extra work for us that llvm will already do during optimization.

To break it down...

Sometimes passing as aggregate and sometimes passing by pointer

This makes it so that we have tons of split code paths. One path to handle the aggregate and one to handle the pointers. We can not simply load the pointer and then handle it like an aggregate. It is considered a bad idea to load aggregates into registers (especially large aggregates). As such, for the pointer path we use pointer offsets and small loads while for the register version, we just directly extract values.

This is simply extra code paths and complexity. It also is for ~0 gain. While this does give us finer grain control, llvm passes already have solutions for this that are reasonable.

Our heuristic for passing in registers is simply based on the size in bytes

This is simply a really naive heuristic and probably ends up hurting pointers very often. When llvm sees an aggregate struct, it will put every single field in its own registers. This means that our naive heuristic can lead to really bad results.

Consider this roc type:

RGB : { r : U8, g : U8, b : U8 }
PairRGB : (RGB, RGB)

Both RGB and PairRGB our under our threshold. As such, we will pass them as aggregate values. This means that on x86_64, a single PairRGB value will immediately consume all registers and everything else will have to be passed on the stack.

This means that lots of small data that just happens to be many fields can quickly consume tons of registers and really hurt performance.

When we pass them by pointer, we never pass them byval or byref

Simply put, byval pointers are the pointers that are meant to be implicitly passed on the stack. In general, llvm understands how to optimize byval pointers. If they are reasonable to pass in registers, llvm will do that. If a function uses it in a read only way, llvm will avoid copying it over and over again (which should always be the case in roc).

byref pointers are similar to byval but do not make a copy to put them into the next stack frame.

Given everything is immutable in roc, I think we can pass everything as ptr readonly byref(%type) align(%align). llvm will then automatically promote to registers if it see fit to do so. On top of that, being byref instead of byval avoids the duplication on the stack required for potential mutation (can still mutate something nested in a pointer like a list).


Bonus: defining everything as pointers means that we can specify alignment correctly everywhere instead of depending on nesting types like [i128 * 0] to get 16 byte alignment. You can directly specify alignment of pointers and allocas.


One annoyance. The argpromotion pass is currently set to only promote to pass in registers if the struct is 2 values or less. I want to set that to at least 3 (so string and list can pass as multiple scalars if reasonable for a function call). This is an argument when generating the c++ pass, but I'm not sure if we can set it from inkwell. Especially given we now create our pass pipeline via a giant string of pass names. Need to look into this more. Even If I can't set this now, I would rather simplify everything and figure out setting it later. As a note, it is smart enough to recognize unused fields. So a list where only the pointer and length are accessed is considered only 2 elements.

view this post on Zulip Brendan Hansknecht (Dec 15 2024 at 20:28):

Hmmm....just realized a problem with my idea. Argpromotion does not deal with return by pointer. So all aggregates would always be returned by pointer instead of being return in registers.....

view this post on Zulip Brendan Hansknecht (Dec 15 2024 at 20:29):

So unless we don our own packing, this will still be suboptimal

view this post on Zulip Brendan Hansknecht (Dec 15 2024 at 20:31):

Maybe we just need to consider all function args and returns no matter what as special. Shape them into exactly what we want. Then inside a function always convert to pointers to allocas for structured types so that we can access everything in a uniform way. llvm definitely is good a removing write once and then read only allocas like could be used for arguments..... Then at least all the struct/pointer conversion code can just be for function arguments and returns and everything else can uniformly be pointers.

view this post on Zulip Brendan Hansknecht (Dec 15 2024 at 20:32):

cc @Folkert de Vries and @Ayaz Hafiz for opinions and ideas

Since I want to fully revamp and fix up c abi, might as well think about cleaning up the fully system.

view this post on Zulip Brendan Hansknecht (Dec 15 2024 at 20:37):

Also, I was looking at rust and their heuristic seems to be overly simple. 2 items struct: pass as separate args. 3+ arg structs: pass as pointer

view this post on Zulip Richard Feldman (Dec 15 2024 at 21:02):

Brendan Hansknecht said:

Given everything is immutable in roc, I think we can pass everything as ptr readonly byref(%type) align(%align).

hm, but they can be mutated if the refcount is 1, yeah?

view this post on Zulip Brendan Hansknecht (Dec 15 2024 at 21:08):

readonly is only top level, not recursive. We pass the list as a readonly pointer. We then read the underlying allocation pointer from the list. That pointer is not read only. We can write to that.

view this post on Zulip Richard Feldman (Dec 15 2024 at 21:33):

ahh right!

view this post on Zulip Ayaz Hafiz (Dec 16 2024 at 01:22):

+1 to reducing code paths, whatever form it takes. This seems reasonable. We have to pass large structs by pointer though, there was a huge regression that was the impetus for that

view this post on Zulip Brendan Hansknecht (Dec 16 2024 at 02:17):

Yeah, and the plan here would be to pass all struct by pointer and then leave llvm to change them to by value as an optimization.

view this post on Zulip Ayaz Hafiz (Dec 16 2024 at 02:45):

makes sense to me.

view this post on Zulip jan kili (Dec 16 2024 at 15:21):

By delegating value-ification to llvm, would we avoid the classes of mis-mappings that result in platforms reading non-pointer registers incorrectly? Does that address alignment+packing and solve register selection? Would platform roc_fx_ functions still need to choose whether to expect a value or pointer?

view this post on Zulip Brendan Hansknecht (Dec 16 2024 at 16:50):

Not at all sadly

view this post on Zulip Brendan Hansknecht (Dec 16 2024 at 16:50):

Llvm does not know how to speak c abi

view this post on Zulip Brendan Hansknecht (Dec 16 2024 at 16:51):

So for c abi, we still have to do something special

view this post on Zulip Brendan Hansknecht (Dec 16 2024 at 16:52):

As mentioned above, we could still choose to speak a limited subset of c abi to make our lives a lot easier.

view this post on Zulip Brendan Hansknecht (Dec 16 2024 at 16:58):

Actually, if we force all structs and unions (minus enums) to be by pointer and all other types to be by value, that would make for very simple valid c abi......I might do that, at least as a starting point. It would be very similar to our _generic function today.

Compared to proper c abi, this minorly hurts perf for any record or union under 16 bytes. All other primitives would be unchanged.

view this post on Zulip Brendan Hansknecht (Dec 16 2024 at 17:03):

That said, if we use this restricted API, and let llvm do promotion to values (which is only a little sad due to cutting off at 2 registers instead of 3, without us being able to tune the arg to my knowledge), I think we could actually have the same internal llvm call conv as our external c call conv and that sounds worth it.

view this post on Zulip jan kili (Dec 16 2024 at 18:06):

Sweet!

view this post on Zulip jan kili (Dec 16 2024 at 18:06):

Brendan Hansknecht said:

Compared to proper c abi, this minorly hurts perf for any record or union under 16 bytes.

I imagine that if a platform cared strongly about perf, it could likely pass a single U* value instead, for low cost in app side bit math.

view this post on Zulip Brendan Hansknecht (Dec 16 2024 at 18:47):

They could try. May not work in a lot of cases due to roc having no sort of unsafe cast.

view this post on Zulip Brendan Hansknecht (Dec 16 2024 at 18:48):

But also, if we get to the point where a well design platform is hitting perf issues due to this, we can work on a fix then (and if we ever allow platform + app lto, it would likely fix all these issues anyway)

view this post on Zulip Brendan Hansknecht (Dec 16 2024 at 18:48):

Overall not much of a concern especially given roc never mutates anything

view this post on Zulip Richard Feldman (Dec 16 2024 at 18:53):

yeah this feels like another case (lambda sets being another) where the volume of bugs has been so high for so long that it definitely feels like intentionally sacrificing some performance for the sake of correctness is the right move, especially when the performance delta we're talking about is this small

view this post on Zulip Brendan Hansknecht (Dec 16 2024 at 19:02):

Yeah, the main cost here is a small amount of extra indirection (still highly likely in L1 cache) when calling to/from the host.

Then internally, if we can't tune the llvm arg promotion pass, passing/returning lists and strings by pointer instead of in registers. I would guess this is actually more likely to be a perf hit than the c abi stuff. That said, I still would bet it is an exceptionally small cost.

To be even more clear here, in other languages, they have to pass by reference most of the time due to allowing mutation. So this would be no worse than c or rust from the call conv side. It will waste more stack space on stale versions of lists/strings that no longer matter. (Due to us always making a new list/str struct instead of mutating in place).


Last updated: Jul 06 2025 at 12:14 UTC