very interesting! https://gist.github.com/FeepingCreature/5dff669aad380a123b15659e195fb96c
Now, LLVM is a very good optimizer, but this does not leave it much room. The value has to go on the stack, which means there must be space for it there, it must be copied out of the register it is probably living in, and it has to remember which parts of the stack are in use and which ones can be reused by another call, which it turns out to be pretty poor at.
I think this is actually just a misunderstanding that llvm doesn't really deal with calling conventions and such. Cause the proper solution generally is to pass a single pointer to the entire struct all the way down the stack. So there shouldn't be any of this copying to begin with.
So fundamentally they are passing by value when they should be passing by reference.
As such they keep copying over and over again.
Also, llvm fastcall should enable slipping the stack for things like this and not following the amd sysv arg passing rules.
Also, this is a dangerous microbenchmark
Passing 3x the args Will eat Up all of the registers very very quickly in an x86 system. So they will very quickly hit bad perf due to that. Also, passing everything in registers likely means way more shuffling around of data and pushing and popping (as arg lists get longer and functions more complex)
Lastly, they explicitly block inlining, which would make small functions like those equivalent. And for larger functions, the cost diminishes quickly.
So I think this falls into the category of a bad microbenchmark for the most part
Interesting!
At the same time, I should clarify, as long as the function doesn't/can't be inlined(or changed to llvm fastcall), that specific benchmark will be faster with the split version. Of course, with less gains the longer the function gets. Also, I guess depending the calling context, it could lead to popping a bunch of stuff to make the function call, but that is unlikely.
Sysv has exactly 6 registers. So they are using the 6 register perfectly and never put anything on the stack.
As a fun note, enabling lto on that example (and thus enabling inlining) is more than 2x faster than having a function call at all.
Tangent, the related benchmark linked in that post might be interesting to implement in roc (assuming our json parsing is far enough along:
Last updated: Jul 06 2025 at 12:14 UTC