There have been some performance benchmarks written that show that Roc is "in there" as to running a simple QuickSort as compared to other languages; however, that may not be low level enough to really show comparative performance. I posted an issue showing that currently Roc is over twenty times slower than languages such as C/C++ in doing truly low level stuff. The issue pretty well says it all, but I thought I would sign on here to see if any of you had any ideas about what I might not be doing correctly or what might be done about it...
Just a note, there is List.repeat
Also, thanks for posting, will be interesting to figure out what is going on.
@Brendan Hansknecht, Thanks for the suggestion on the List.repeat
; actually, that was what I tried first and got a message of "not yet implemented" and came up with the kludge. Just tried it and it works fine now. At any rate, it won't affect the benchmark as creating the buffer is only a tiny part of the overall execution time...
Yeah, I guess it isn't implemented on the dev backend yet so doesn't work in the repl.
@GordonBGood we have three code generation back ends at different stages of development. For what you are doing you should run the compiler from the command line with optimisations on. List.repeat should work there.
It should also work on the repl on the website.
Oh, I found the issues: Found 23000 primes to 262146 in 111 milliseconds.
We probably need bounds check hoisting.
Every iteration of the innermost loop, we are checking the bound. That is not necessary cause loop already has a check if c >= 16384 then cp else
If roc could statically analyze that, it could remove the bounds check and have the perf I posted above by manually removing the check
Also, id be curious to see how c++ does if you use a std::vector and only access it through checked methods. That would be the equivalent to what roc should be generating
Brian Carroll said:
GordonBGood we have three code generation back ends at different stages of development. For what you are doing you should run the compiler from the command line with optimisations on. List.repeat should work there.
Yes, we are past that as List.repeat works when optimization is turned on...
@Brendan Hansknecht,
We probably need bounds check hoisting.
Every iteration of the innermost loop, we are checking the bound.
Thirty-five CPU clock cycles per array bounds check is pretty expensive bounds checking. I suspect that the bounds check is triggering also doing something else such as not inlining a function of not lifting or something, which is why I asked if it is possible to view the assembly output.
Yes, automatic bounds checking elision is something that sophisticated compilers do, but when optimized properly, array bounds checks should only take a CPU clock cycle or less.
Also, id be curious to see how c++ does if you use a std::vector and only access it through checked methods. That would be the equivalent to what roc should be generating
I don't know about C++, but the bounds check in Rust costs something about a CPU cycle each check; I'd assume that it would be about the same with C++...
May or may not be useful, but you can use --emit-llvm-ir
if you want to generate a .ll
file and see what roc is generating.
@Brendan Hansknecht,
Every iteration of the innermost loop, we are checking the bound. That is not necessary cause loop already has a check if c >= 16384 then cp else
Most compilers such as F#/C# that do bounds check elision require that the loop upper bound be the explicitly the length of the array as in "if c >= (cmpsts |> List.len) then cp else". I tried that with no appreciable difference in execution time
@Luke Boswell,
May or may not be useful, but you can use --emit-llvm-ir if you want to generate a .ll file and see what roc is generating.
Did that, but too much code to easily follow with obfuscated labels as to what part of the source code generated them. When looking at assembly code, I can usually identify the innermost loop by the short loop with an "or" bitwise operation applied to 8-bit operands. Could run the .ll code through LLVM's opt and llc programs to generate the assembly, but don't know what arguments are applied
@Brendan Hansknecht,
Oh, I found the issues: Found 23000 primes to 262146 in 111 milliseconds.
What was the clock speed for the above test? And what was the issue? Assuming that the above output is after the issue is fixed, and likely still includes bounds checking?
Bounds checking, especially for Nat indices is very cheap if the compiler is efficiently using registers, which LLVM generated code usually does: comparing the index in a register to the array size in another register takes a quarter or third of a CPU clock cycle and a check and branch on condition overflow takes about zero time when it matches the branch prediction, which will normally be to not take the overflow branch.
Even when if Int indices are used, it takes just another branch on register negative which will still match the prediction of not taking the branch in the normal case.
It may be barely worth all the work of detecting and eliding the bounds check away when not needed...
@GordonBGood do you have a small piece of code the reproduces the List.repeat
TODO? it works just fine in the repl for a number of types so I'm confused what the issue could be
the bounds check in Rust costs something about a CPU cycle each check
I don't think there will be a bounds check in the F# or Rust. They are using a static sized array. As such, it should be trivial to know the size and avoid the bounds check completely.
"if c >= (cmpsts |> List.len) then cp else". I tried that with no appreciable difference in execution time
Yeah, I noticed that too, which is definitely odd.
What was the clock speed for the above test? And what was the issue? Assuming that the above output is after the issue is fixed, and likely still includes bounds checking?
I removed the bounds check from List.update
which is called on repeat in the innermost loop. That is the only change.
On my machine, timing when from about 1500ms
to about 100ms
. The posted c test takes about 70ms
for reference.
That is on an intel linux machine with a Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
. Of course that is base freq. Peak should be around 4GHz though sustain is closer to 3GHz IIRC.
Ok I have been digging into the IR and here are the two main problems from what I can see:
List.get
is a regular function, not a low level (and we don't have borrow inference). As such, before every call to the function we increment the refcount and in the function we decrement the refcount (checking if we need to deallocate each time).List.get
. LLVM is smart enough to realize that inlining that function with allocas in the wrong place would lead to a stack space explosion. As such, it refuses to inline the function unless the the allocas are fixed.So I did some testing to verify. One, I updated and used my borrowing-hacks
branch to avoid the first issue. Then I manually edited the llvm ir to fix the second issue.
The result: Found 23000 primes to 262146 in 103 milliseconds.
Note: both changes are need for the perf.
No changes: 1.5s
Just borrowing-hacks: 1.2s
Just alloca fix: 1.0s
both: 100ms
Oh, one extra note on the alloca. Each branch is generating it's own output alloca and then we are using a phi node to pick the one we want. LLVM doesn't seem to know how to optimize that. It is important that each branch is writing to the same alloca.
In fact, that is actually the biggest issue, my line labelled Just alloca fix
above was still using a phi
node.
This is just alloca fix without the phi node: 200ms
So that still has all of the extra refcounting
Sounds like we have some pretty big llvm ir changes to make we need to:
Made an overall tracking bug: #6434
@Folkert de Vries,
do you have a small piece of code the reproduces the List.repeat TODO
I'm sorry, I can no longer reproduce the issue. It was said by others that this function didn't yet get added to some DEV releases, but I just tested and it now works in my sample code, so I don't know what the original problem was.
@Brendan Hansknecht,
I don't think there will be a bounds check in the F# or Rust. They are using a static sized array. As such, it should be trivial to know the size and avoid the bounds check completely.
F# to DotNet will put in the bounds check unless the upper bound in the check is specifically specified as cmpsts.Length
as that is what is required by DotNet to do the bounds check eliding. Fable to JavaScript has no bounds check because JavaScript doesn't handle this without a lot of overhead. Rust, being the "safe" language that it is will put in the bounds check although I'm not sure about bounds check eliding for specific cases; if one want to guaranty no bounds check for Rust vectors, one needs to use the unsafe get pointer manipulations as far as I know, although things may have changed with new version.
"if c >= (cmpsts |> List.len) then cp else". I tried that with no appreciable difference in execution time
Yeah, I noticed that too, which is definitely odd.
It seems to just just mean that you don't have bounds check eliding working yet.
On my machine, timing when from about 1500ms to about 100ms. The posted c test takes about 70ms for reference.
That is very similar to as on my machine. When the tests get short as in 70 or 100 milliseconds, testing on laptop CPU's (just as mine also basically is) gets a bit strange as to prediction of what the boost CPU clock rates will be, as depending on the power profile it may take more time than the length of the test to fully ramp the clock rate up to the full boost rate. I suspect that the C reference is actually running at about 1.5 clock cycles per loop rather than the calculated about 1.75 based on maximum boost, which means that, at least on my machine, this isn't long enough to fully ramp up the clock rate. This is verified by running the test for 10,000 and 100,000 loops and observing that the time is less than as proportional to the number of extra loops.
No changes: 1.5s
Just borrowing-hacks: 1.2s
Just alloca fix: 1.0s
both: 100ms
That is very interesting, that both "fixes" need to be applied to get expected behavior.
Sounds like we have some pretty big llvm ir changes to make we need
Yes, it looks pretty complex!!!
Thanks for your replies and your efforts.
Yeah, one of the annoying things with llvm is that it is made for c/c++ like code. With that, it expects the IR to be in specific forms or it will give up on many optimizations. Since our IR is not as llvm expects, we are hitting loss of optimizations and overall poor perf.
That said, it has gotten more general over the years as more languages target it
I edited the Roc code in the OP of the issue to clean up the code a bit and to use List.repeat
instead of the kludge. I also nest the closures as, if nesting closures turns out to be the problem, then it needs to be fixed...
:+1:
@Brendan Hansknecht .
Yeah, one of the annoying things with llvm is that it is made for c/c++ like code. With that, it expects the IR to be in specific forms or it will give up on many optimizations. Since our IR is not as llvm expects, we are hitting loss of optimizations and overall poor perf.
Well, GHC Haskell also has an optional LLVM back-end and that works well enough; in fact, most times it produces better code than GHC's Native Code Generator, especially as related to register allocations.
I don't know much about the Roc code base, but I wonder if you are getting too deep into the complexity of handling all the variations through Rust imperative language details instead of stepping back and looking how it would usually be done functionally, as Unique/Linear Types have been around for quite a long time and implemented in many languages. I wonder if using Graph data flow analysis in generation of AST with uniqueness flags already generated might save some grief on the back-end code generation...
All your references to alloca
manipulations would seem to be a concern as to reliable and easy to maintain code, but that is just my two cents...
Well, GHC Haskell also has an optional LLVM back-end and that works well enough
For sure, but that is because their backend has been written in a way to generate the type of ir that llvm expects. In general, the closer to a c/c++ style ir that is generated the more perf you will get out of llvm. Our backend will do the same over time, but currently we have cases were we do things that don't play nicely with the llvm optimizations. Like this.
All your references to
alloca
manipulations would seem to be a concern as to reliable and easy to maintain code, but that is just my two cents...
I am sure that GHC and all other llvm backends do the type of alloca
handling that I mentioned above. It is essentially a requirement to get llvm to optimize your code well. alloca
is just part of life when generating llvm ir.
That said, this ties into what I meant when I said that llvm is that it is made for c/c++ like code. A functional language specific llvm pass that makes different assumptions about pointer and aliasing rules could totally optimize these alloca
instructions correctly.
llvm sadly assumes that all allocas will always be at the beginning of a function (that is what c does after all). It is smart enough to generally do hoisting correctly when inlining, but it can generate very broken code when combining some of these assumptions with tail calls. llvm also assumes that essentially all pointers can alias. This ruins tons of optimizations that would be obvious in other contexts.
Anyway, all that to say our current backend is not generating what llvm expects and we need to change that
@Brendan Hansknecht,
Thank you for your explanation - it looks like you have a long row to hoe... ;-)
@Brendan Hansknecht,
I did "--emit-llvm-ir" on the optimized example and the innermost loop looks like the following:
else_block.i.i.i.i.i.i.i.i.i.i: ; preds = %"#Attr_#inc_1.exit.i.i.i.i.i.i.i.i.i.i"
%load_element.i.i.i.i.i.i.i.i.i.i = load i8, ptr %get_opaque_data_ptr.i.i.i.i.i.i.i.i.i.i, align 8
%int_bitwise_or.i.i.i.i.i.i.i.i.i.i.i.i = or i8 %load_element.i.i.i.i.i.i.i.i.i.i, %int_shift_left.i.i.i.i.i.i.i.i.i
%23 = getelementptr inbounds i8, ptr %4, i64 %joinpointarg.i.i.i.i.i.i.i.i.i93
store i8 %int_bitwise_or.i.i.i.i.i.i.i.i.i.i.i.i, ptr %23, align 1
br label %List_update_8c3e23999fb4a9a5772421e445946d011124a10dbca352b99c3db8b699f4098.exit.i.i.i.i.i.i.i.i.i
List_update_8c3e23999fb4a9a5772421e445946d011124a10dbca352b99c3db8b699f4098.exit.i.i.i.i.i.i.i.i.i: ; preds = %else_block.i.i.i.i.i.i.i.i.i.i, %"#Attr_#inc_1.exit.i.i.i.i.i.i.i.i.i.i"
call void @llvm.lifetime.end.p0(i64 2, ptr nonnull %result_value.i.i.i.i.i.i.i.i.i.i)
%gte_uint.i.i.i.i.i.i.i.i.i.i = icmp ugt i64 %operation_result.i.i.i.i.i.i.i.i.i.i, 16383
br i1 %gte_uint.i.i.i.i.i.i.i.i.i.i, label %joinpointcont.i.i.i.i.i.i.i.i.loopexit, label %else_block.i.i.i.i.i.i.i.i.i
It actually looks like it would be possible for this to be optimized into some reasonable assembly code, assuming that the load/modify/write sequence gets emitted as a single op code, that the %23
bounds check gets done before the above compound instruction, that the redundant branch to the line following gets elided away, and that the call to do with llvm.lifetime gets elided away. If some of that optimization is not happening, it could explain the slowness.in the lifetime call likely due to the placement of the alloca
in the source for this, as you explained
There are no alloca
statements in this at all???
It does answer the question about List.update and the update function not being inlined as they obviously are.
As I said before, I would really like to so the assembly code that this is producing! If this is the code after optimization, I can definitely see some potential problems in the lifetime call, which may cost variable amounts of time depending on the arguments passed into it...
List.update gets inlined, but List.get called by List.update does not get inlined.
The allocas are in List.get
Also, I think you don't have the full inner loop. Just part of it. I can try and grab the full thing later for reference.
@Brendan Hansknecht,
List.update gets inlined, but List.get called by List.update does not get inlined. The allocas are in List.get
Okay, thanks, I understand.
Also, I think you don't have the full inner loop. Just part of it. I can try and grab the full thing later for reference.
The snippet is what I got from primes-bench.ll, but I guess I probably don't completely understand how the whole inner loop gets put together, as I don't see where the index gets advanced per loop...
I think this is the code for the innermost loop of the cull function with current roc.
You can see the last 2 lines checking the >= 16384
conditional and then branch back to the first block at the top here.
loop body
Plus this cause it doesn't get inlined:
List.get
Also, man, counting the exact number of .i
suffixes is a pain.
your editor can tell you how many characters the current selection is, right?
still not ideal of course
@Brendan Hansknecht,
I think this is the code for the innermost loop of the cull function with current roc.
Thanks for that.
Plus this (List.get - GBG) cause it doesn't get inlined:
That's pretty serious!
You can see the last 2 lines checking the >= 16384 conditional and then branch back to the first block at the top here.
Yes, I identified the end of the loop correctly, but
Also, man, counting the exact number of .i suffixes is a pain.
I miss-counted number of suffixes for the branch back to the beginning of the loop.
List.get is kinda ridiculous cause we don't have borrow inference. It should just be a size check, grabbing the element, and returning a result.
Due to lack of borrowing, it also has to decrement the refcount and maybe free the entire list. So a significant chunk of the code is spent handing refcounting and deallocation related things
With borrowing, list.get becomes this, which is much more reasonable. Still has the messed up allocas that stop inline and some other llvm optimizations, but a reasonable sized function now.
List.get with borrowing (aka no refcounting decrement)
For completeness, after fixing the alloca issue as well and rerunning the llvm passes, the entire loop turns into this:
the expected ir
I do notice this in the part of LLVM's inlining analysis (Analysis/InlineCost.cpp
) that looks at allocas:
// FIXME: This is overly conservative. Dynamic allocas are inefficient for
// a variety of reasons, and so we would like to not inline them into
// functions which don't currently have a dynamic alloca. This simply
// disables inlining altogether in the presence of a dynamic alloca.
if (!I.isStaticAlloca())
HasDynamicAlloca = true;
I might add a flag to this pass/disable locally and rerun opt to see if that helps. Given that there's a fixme right above this snippet, I bet we could upstream a patch to add a flag to tune this behavior while keeping the default the same.
That would be really awesome. Though we should still fix our alloca generation. In their guide for frontend authors, llvm is very clear that allocas outside of the entry block mess with a number of optimization: https://llvm.org/docs/Frontend/PerformanceTips.html#use-of-allocas
In our specific case, even if you move the two allocas to the entry block, another very important piece for performance is merging the two allocas into a single alloca. I assume llvm is unwilling to reason about a phi node that takes two allocas (probably do to aliasing rules). As such, it won't recognize the many following optimizations that can happen by writing the output directly.
@Brendan Hansknecht,
the entire loop turns into this: ...
That looks almost exactly like the C loop other than that the get pointer is still maybe doing a bounds check; it should run about the same or at least not much slower than the C code.
No bounds check in the loop.
It is increment the element offset. Calculate the memory address relative to the list pointer. Load the element. Bitwise or it. Store it back. Length check. Loop back to beginning.
So compared to c, should be one extra instruction. C is updating the pointer directly. Roc is updating an index and then calculate the correct pointer. So should be one extra lea
instruction.
@Brendan Hansknecht,
So compared to c, should be one extra ... lea instruction.
That's still increases the number of CPU clock cycles for this tight loop of about 1.5 cycles in C by at least 0.25 cycles (if some care is taken in indexing mode, 0.5 cycles for complex indexing which I don't think will be used here - Zen 4) for an increase of 16.7% to 33.3%.
So if C takes 70 milliseconds, Roc with all of your optimizations will take either 82 or 93 milliseconds on Zen 4; a little faster than the about 100 milliseconds you forced before on Intel but I think you've advanced your optimizations a little more since then by assuming that Roc will have automatic bounds check eliding...
by assuming that Roc will have automatic bounds check eliding
Actually, llvm was smart enough to pull the bounds check out of the loop and instead use a single check before the loop. So nothing needed on the roc side at least for a direct case like this.
So if C takes 70 milliseconds, Roc with all of your optimizations will take either 82 or 93 milliseconds on Zen 4
Would depend some on cache timings, right? Cause all the extra instruction time may be hidden by waiting on loading elements from the l1 cache (probably takes about 4 cycles)? Right?
Actually, looking at the C code, it does cmpsts[c]
which also should be a lea
instruction. So I would expect it to generate the exact same assembly for the innermost loop.
@Brendan Hansknecht,
Actually, llvm was smart enough to pull the bounds check out of the loop and instead use a single check before the loop.
Wow, LLVM really pulled it off there.
(My timing estimates -GBG) Would depend some on cache timings, right?
Although modifying the L1 cache has a latency of about 4 clock cycles, the throughput is about 1 cycle once the read/modify/write phases have been combined into one instruction, but that doesn't have much to do with uses of the LEA instruction, which is a register instruction.
Actually, looking at the C code, it does cmpsts[c] which also should be a lea instruction. So I would expect it to generate the exact same assembly for the innermost loop.
It depends on the compiler: LLVM tends to use a LEA instruction and a simple addressing mode on the read/modify/write instruction; GCC tends to skip the LEA and combine the indexing in the read/modify/write instruction. The difference in timing isn't much and also depends on the CPU on which the code is run, but often the GCC way is a tiny bit faster.
My C code timing used GCC but I don't have a working computer just now to output the assembler (the -S command line option). It's often fun to compare the output of GCC to that of Clang, which uses LLVM.
Went to godbolt and got some assembly then dumped the roc assembly:
gcc and clang generate essentially the same thing.
gcc:
.L5:
or BYTE PTR [rdi+rax], dl
add rax, rsi
cmp eax, 16383
jle .L5
clang:
.LBB0_6: # Parent Loop BB0_1 Depth=1
or byte ptr [rax + r10], r11b
add r10, rsi
cmp r10, 16384
jb .LBB0_6
In Roc with the borrowing hack and the alloc fix. We also essentially get the same assembly for the inner loop:
.LBB13_44:
or byte ptr [r13 + rax], r9b
add rax, rsi
cmp rax, 16383
jbe .LBB13_44
All of them optimize to essentially the same 4 instructions (all with a slightly different comparison and jump instruction which is interesting)
@Brendan Hansknecht,
All of them optimize to essentially the same 4 instructions (all with a slightly different comparison and jump instruction which is interesting)
Thanks for that.
Yes, all of them should produce identical execution times, but the differences are interesting.
Had to look it up cause I never remember. JB is the unsigned version and JL is the signed version.
Then of course the E suffix is for or equal.
@Brendan Hansknecht,
I see you've made some commits preparing for a solution to the low performance for this type of low level code but as of current nightlies the code is still performing slowly. Any estimates on when this might be resolved?
My commits were just hacks to prove out the concepts required. The real changes are larger projects. I sadly have not had the time/energy to tackle them. So mostly they are just documented and waiting.
Last updated: Jul 06 2025 at 12:14 UTC