Low Roc "bit-twiddling" performance... · show and tell

There have been some performance benchmarks written that show that Roc is "in there" as to running a simple QuickSort as compared to other languages; however, that may not be low level enough to really show comparative performance. I posted an issue showing that currently Roc is over twenty times slower than languages such as C/C++ in doing truly low level stuff. The issue pretty well says it all, but I thought I would sign on here to see if any of you had any ideas about what I might not be doing correctly or what might be done about it...

Brendan Hansknecht (Jan 26 2024 at 07:06):

GordonBGood (Jan 26 2024 at 07:12):

@Brendan Hansknecht, Thanks for the suggestion on the List.repeat; actually, that was what I tried first and got a message of "not yet implemented" and came up with the kludge. Just tried it and it works fine now. At any rate, it won't affect the benchmark as creating the buffer is only a tiny part of the overall execution time...

Brendan Hansknecht (Jan 26 2024 at 07:12):

Yeah, I guess it isn't implemented on the dev backend yet so doesn't work in the repl.

Brian Carroll (Jan 26 2024 at 07:34):

@GordonBGood we have three code generation back ends at different stages of development. For what you are doing you should run the compiler from the command line with optimisations on. List.repeat should work there.
It should also work on the repl on the website.

Brendan Hansknecht (Jan 26 2024 at 07:40):

Brendan Hansknecht (Jan 26 2024 at 07:42):

Every iteration of the innermost loop, we are checking the bound. That is not necessary cause loop already has a check if c >= 16384 then cp else

If roc could statically analyze that, it could remove the bounds check and have the perf I posted above by manually removing the check

Brendan Hansknecht (Jan 26 2024 at 07:48):

Also, id be curious to see how c++ does if you use a std::vector and only access it through checked methods. That would be the equivalent to what roc should be generating

GordonBGood (Jan 26 2024 at 08:01):

GordonBGood (Jan 26 2024 at 08:11):

Thirty-five CPU clock cycles per array bounds check is pretty expensive bounds checking. I suspect that the bounds check is triggering also doing something else such as not inlining a function of not lifting or something, which is why I asked if it is possible to view the assembly output.

Yes, automatic bounds checking elision is something that sophisticated compilers do, but when optimized properly, array bounds checks should only take a CPU clock cycle or less.

I don't know about C++, but the bounds check in Rust costs something about a CPU cycle each check; I'd assume that it would be about the same with C++...

Luke Boswell (Jan 26 2024 at 08:15):

May or may not be useful, but you can use --emit-llvm-ir if you want to generate a .ll file and see what roc is generating.

GordonBGood (Jan 26 2024 at 08:15):

Most compilers such as F#/C# that do bounds check elision require that the loop upper bound be the explicitly the length of the array as in "if c >= (cmpsts |> List.len) then cp else". I tried that with no appreciable difference in execution time

GordonBGood (Jan 26 2024 at 09:05):

Did that, but too much code to easily follow with obfuscated labels as to what part of the source code generated them. When looking at assembly code, I can usually identify the innermost loop by the short loop with an "or" bitwise operation applied to 8-bit operands. Could run the .ll code through LLVM's opt and llc programs to generate the assembly, but don't know what arguments are applied

GordonBGood (Jan 26 2024 at 10:07):

What was the clock speed for the above test? And what was the issue? Assuming that the above output is after the issue is fixed, and likely still includes bounds checking?

Bounds checking, especially for Nat indices is very cheap if the compiler is efficiently using registers, which LLVM generated code usually does: comparing the index in a register to the array size in another register takes a quarter or third of a CPU clock cycle and a check and branch on condition overflow takes about zero time when it matches the branch prediction, which will normally be to not take the overflow branch.

Even when if Int indices are used, it takes just another branch on register negative which will still match the prediction of not taking the branch in the normal case.

It may be barely worth all the work of detecting and eliding the bounds check away when not needed...

Folkert de Vries (Jan 26 2024 at 15:39):

@GordonBGood do you have a small piece of code the reproduces the List.repeat TODO? it works just fine in the repl for a number of types so I'm confused what the issue could be

Brendan Hansknecht (Jan 26 2024 at 16:09):

I don't think there will be a bounds check in the F# or Rust. They are using a static sized array. As such, it should be trivial to know the size and avoid the bounds check completely.

Brendan Hansknecht (Jan 26 2024 at 16:10):

Brendan Hansknecht (Jan 26 2024 at 16:18):

I removed the bounds check from List.update which is called on repeat in the innermost loop. That is the only change.

On my machine, timing when from about 1500ms to about 100ms. The posted c test takes about 70ms for reference.

That is on an intel linux machine with a Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz. Of course that is base freq. Peak should be around 4GHz though sustain is closer to 3GHz IIRC.

Brendan Hansknecht (Jan 26 2024 at 17:25):

Ok I have been digging into the IR and here are the two main problems from what I can see:

So I did some testing to verify. One, I updated and used my borrowing-hacks branch to avoid the first issue. Then I manually edited the llvm ir to fix the second issue.
The result: Found 23000 primes to 262146 in 103 milliseconds.

Brendan Hansknecht (Jan 26 2024 at 17:41):

Oh, one extra note on the alloca. Each branch is generating it's own output alloca and then we are using a phi node to pick the one we want. LLVM doesn't seem to know how to optimize that. It is important that each branch is writing to the same alloca.

Brendan Hansknecht (Jan 26 2024 at 17:43):

In fact, that is actually the biggest issue, my line labelled Just alloca fix above was still using a phi node.

Brendan Hansknecht (Jan 26 2024 at 17:43):

Brendan Hansknecht (Jan 26 2024 at 17:45):

Brendan Hansknecht (Jan 26 2024 at 17:53):

GordonBGood (Jan 26 2024 at 19:21):

I'm sorry, I can no longer reproduce the issue. It was said by others that this function didn't yet get added to some DEV releases, but I just tested and it now works in my sample code, so I don't know what the original problem was.

GordonBGood (Jan 26 2024 at 19:40):

F# to DotNet will put in the bounds check unless the upper bound in the check is specifically specified as cmpsts.Length as that is what is required by DotNet to do the bounds check eliding. Fable to JavaScript has no bounds check because JavaScript doesn't handle this without a lot of overhead. Rust, being the "safe" language that it is will put in the bounds check although I'm not sure about bounds check eliding for specific cases; if one want to guaranty no bounds check for Rust vectors, one needs to use the unsafe get pointer manipulations as far as I know, although things may have changed with new version.

It seems to just just mean that you don't have bounds check eliding working yet.

That is very similar to as on my machine. When the tests get short as in 70 or 100 milliseconds, testing on laptop CPU's (just as mine also basically is) gets a bit strange as to prediction of what the boost CPU clock rates will be, as depending on the power profile it may take more time than the length of the test to fully ramp the clock rate up to the full boost rate. I suspect that the C reference is actually running at about 1.5 clock cycles per loop rather than the calculated about 1.75 based on maximum boost, which means that, at least on my machine, this isn't long enough to fully ramp up the clock rate. This is verified by running the test for 10,000 and 100,000 loops and observing that the time is less than as proportional to the number of extra loops.

That is very interesting, that both "fixes" need to be applied to get expected behavior.

Brendan Hansknecht (Jan 26 2024 at 20:14):

Yeah, one of the annoying things with llvm is that it is made for c/c++ like code. With that, it expects the IR to be in specific forms or it will give up on many optimizations. Since our IR is not as llvm expects, we are hitting loss of optimizations and overall poor perf.

Brendan Hansknecht (Jan 26 2024 at 20:15):

That said, it has gotten more general over the years as more languages target it

GordonBGood (Jan 27 2024 at 02:33):

I edited the Roc code in the OP of the issue to clean up the code a bit and to use List.repeat instead of the kludge. I also nest the closures as, if nesting closures turns out to be the problem, then it needs to be fixed...

Brendan Hansknecht (Jan 27 2024 at 02:38):

GordonBGood (Jan 27 2024 at 02:45):

Well, GHC Haskell also has an optional LLVM back-end and that works well enough; in fact, most times it produces better code than GHC's Native Code Generator, especially as related to register allocations.

I don't know much about the Roc code base, but I wonder if you are getting too deep into the complexity of handling all the variations through Rust imperative language details instead of stepping back and looking how it would usually be done functionally, as Unique/Linear Types have been around for quite a long time and implemented in many languages. I wonder if using Graph data flow analysis in generation of AST with uniqueness flags already generated might save some grief on the back-end code generation...

All your references to alloca manipulations would seem to be a concern as to reliable and easy to maintain code, but that is just my two cents...

Brendan Hansknecht (Jan 27 2024 at 03:18):

For sure, but that is because their backend has been written in a way to generate the type of ir that llvm expects. In general, the closer to a c/c++ style ir that is generated the more perf you will get out of llvm. Our backend will do the same over time, but currently we have cases were we do things that don't play nicely with the llvm optimizations. Like this.

I am sure that GHC and all other llvm backends do the type of alloca handling that I mentioned above. It is essentially a requirement to get llvm to optimize your code well. alloca is just part of life when generating llvm ir.

That said, this ties into what I meant when I said that llvm is that it is made for c/c++ like code. A functional language specific llvm pass that makes different assumptions about pointer and aliasing rules could totally optimize these alloca instructions correctly.
llvm sadly assumes that all allocas will always be at the beginning of a function (that is what c does after all). It is smart enough to generally do hoisting correctly when inlining, but it can generate very broken code when combining some of these assumptions with tail calls. llvm also assumes that essentially all pointers can alias. This ruins tons of optimizations that would be obvious in other contexts.

Brendan Hansknecht (Jan 27 2024 at 03:19):

Anyway, all that to say our current backend is not generating what llvm expects and we need to change that

GordonBGood (Jan 27 2024 at 03:21):

Thank you for your explanation - it looks like you have a long row to hoe... ;-)

GordonBGood (Jan 27 2024 at 06:39):

I did "--emit-llvm-ir" on the optimized example and the innermost loop looks like the following:

else_block.i.i.i.i.i.i.i.i.i.i:                   ; preds = %"#Attr_#inc_1.exit.i.i.i.i.i.i.i.i.i.i"
  %load_element.i.i.i.i.i.i.i.i.i.i = load i8, ptr %get_opaque_data_ptr.i.i.i.i.i.i.i.i.i.i, align 8
  %int_bitwise_or.i.i.i.i.i.i.i.i.i.i.i.i = or i8 %load_element.i.i.i.i.i.i.i.i.i.i, %int_shift_left.i.i.i.i.i.i.i.i.i
  %23 = getelementptr inbounds i8, ptr %4, i64 %joinpointarg.i.i.i.i.i.i.i.i.i93
  store i8 %int_bitwise_or.i.i.i.i.i.i.i.i.i.i.i.i, ptr %23, align 1
  br label %List_update_8c3e23999fb4a9a5772421e445946d011124a10dbca352b99c3db8b699f4098.exit.i.i.i.i.i.i.i.i.i

List_update_8c3e23999fb4a9a5772421e445946d011124a10dbca352b99c3db8b699f4098.exit.i.i.i.i.i.i.i.i.i: ; preds = %else_block.i.i.i.i.i.i.i.i.i.i, %"#Attr_#inc_1.exit.i.i.i.i.i.i.i.i.i.i"
  call void @llvm.lifetime.end.p0(i64 2, ptr nonnull %result_value.i.i.i.i.i.i.i.i.i.i)
  %gte_uint.i.i.i.i.i.i.i.i.i.i = icmp ugt i64 %operation_result.i.i.i.i.i.i.i.i.i.i, 16383
  br i1 %gte_uint.i.i.i.i.i.i.i.i.i.i, label %joinpointcont.i.i.i.i.i.i.i.i.loopexit, label %else_block.i.i.i.i.i.i.i.i.i

It actually looks like it would be possible for this to be optimized into some reasonable assembly code, assuming that the load/modify/write sequence gets emitted as a single op code, that the %23 bounds check gets done before the above compound instruction, that the redundant branch to the line following gets elided away, and that the call to do with llvm.lifetime gets elided away. If some of that optimization is not happening, it could explain the slowness.in the lifetime call likely due to the placement of the alloca in the source for this, as you explained

It does answer the question about List.update and the update function not being inlined as they obviously are.

As I said before, I would really like to so the assembly code that this is producing! If this is the code after optimization, I can definitely see some potential problems in the lifetime call, which may cost variable amounts of time depending on the arguments passed into it...

Brendan Hansknecht (Jan 27 2024 at 07:01):

List.update gets inlined, but List.get called by List.update does not get inlined.

Brendan Hansknecht (Jan 27 2024 at 07:02):

Also, I think you don't have the full inner loop. Just part of it. I can try and grab the full thing later for reference.

GordonBGood (Jan 27 2024 at 07:14):

The snippet is what I got from primes-bench.ll, but I guess I probably don't completely understand how the whole inner loop gets put together, as I don't see where the index gets advanced per loop...

Brendan Hansknecht (Jan 27 2024 at 23:03):

I think this is the code for the innermost loop of the cull function with current roc.
You can see the last 2 lines checking the >= 16384 conditional and then branch back to the first block at the top here.

loop body

else_block.i.i.i.i.i.i.i.i.i:                     ; preds = %else_block.i.i.i.i.i.i.i.i.i.preheader, %List_update_8c3e23999fb4a9a5772421e445946d011124a10dbca352b99c3db8b699f4098.exit.i.i.i.i.i.i.i.i.i
  %joinpointarg.i.i.i.i.i.i.i.i.i93 = phi i64 [ %operation_result.i.i.i.i.i.i.i.i.i.i, %List_update_8c3e23999fb4a9a5772421e445946d011124a10dbca352b99c3db8b699f4098.exit.i.i.i.i.i.i.i.i.i ], [ %int_shift_right.i.i.i.i.i.i.i.i.i, %else_block.i.i.i.i.i.i.i.i.i.preheader ]
  %call.i.i.i.i.i.i.i.i.i.i = tail call { i64, i1 } @llvm.uadd.with.overflow.i64(i64 %joinpointarg.i.i.i.i.i.i.i.i.i93, i64 %operation_result.i60.i.i.i.i.i.i.i), !dbg !305
  %has_overflowed.i.i.i.i.i.i.i.i.i.i = extractvalue { i64, i1 } %call.i.i.i.i.i.i.i.i.i.i, 1, !dbg !305
  br i1 %has_overflowed.i.i.i.i.i.i.i.i.i.i, label %throw_block.i.i.i.i.i.i.i.i.i.i, label %Num_add_d1a671ca8e59e06f6ed9dd68b64f8b1135fbfc4b5fef16c2e83104a25fe1b.exit.i.i.i.i.i.i.i.i.i, !dbg !305

throw_block.i.i.i.i.i.i.i.i.i.i:                  ; preds = %else_block.i.i.i.i.i.i.i.i.i
  tail call fastcc void @throw_on_overflow(), !dbg !305
  unreachable, !dbg !305

Num_add_d1a671ca8e59e06f6ed9dd68b64f8b1135fbfc4b5fef16c2e83104a25fe1b.exit.i.i.i.i.i.i.i.i.i: ; preds = %else_block.i.i.i.i.i.i.i.i.i
  %operation_result.i.i.i.i.i.i.i.i.i.i = extractvalue { i64, i1 } %call.i.i.i.i.i.i.i.i.i.i, 0, !dbg !305
  call void @llvm.lifetime.start.p0(i64 2, ptr nonnull %result_value.i.i.i.i.i.i.i.i.i.i), !dbg !307
  %20 = load i64, ptr %3, align 8, !dbg !307
  %.not.i.i.i.i.i.i.i7.i.i.i.i.i = icmp eq i64 %20, 0, !dbg !307
  br i1 %.not.i.i.i.i.i.i.i7.i.i.i.i.i, label %"#Attr_#inc_2.exit.i.i.i.i.i.i.i.i.i.i", label %21, !dbg !307

21:                                               ; preds = %Num_add_d1a671ca8e59e06f6ed9dd68b64f8b1135fbfc4b5fef16c2e83104a25fe1b.exit.i.i.i.i.i.i.i.i.i
  %22 = add nsw i64 %20, 1, !dbg !307
  store i64 %22, ptr %3, align 8, !dbg !307
  br label %"#Attr_#inc_2.exit.i.i.i.i.i.i.i.i.i.i", !dbg !307

"#Attr_#inc_2.exit.i.i.i.i.i.i.i.i.i.i":          ; preds = %21, %Num_add_d1a671ca8e59e06f6ed9dd68b64f8b1135fbfc4b5fef16c2e83104a25fe1b.exit.i.i.i.i.i.i.i.i.i
  call fastcc void @List_get_4bc3befd4127054416125539e7bc161699db1de9fc4587f06d2ee1b827f164(%list.RocList %joinpointarg2.i.i.i.i.i.i.i, i64 %joinpointarg.i.i.i.i.i.i.i.i.i93, ptr nonnull %result_value.i.i.i.i.i.i.i.i.i.i), !dbg !160
  %load_tag_id.i.i.i.i.i.i.i.i.i.i = load i8, ptr %tag_id_ptr.i.i.i.i.i.i.i.i.i.i, align 1, !dbg !160
  %eq_u8.i.i.i.i.i.i.i.i.i.i = icmp eq i8 %load_tag_id.i.i.i.i.i.i.i.i.i.i, 0, !dbg !160
  br i1 %eq_u8.i.i.i.i.i.i.i.i.i.i, label %List_update_8c3e23999fb4a9a5772421e445946d011124a10dbca352b99c3db8b699f4098.exit.i.i.i.i.i.i.i.i.i, label %else_block.i.i.i.i.i.i.i.i.i.i, !dbg !160

else_block.i.i.i.i.i.i.i.i.i.i:                   ; preds = %"#Attr_#inc_2.exit.i.i.i.i.i.i.i.i.i.i"
  %load_element.i.i.i.i.i.i.i.i.i.i = load i8, ptr %get_opaque_data_ptr.i.i.i.i.i.i.i.i.i.i, align 8, !dbg !160
  %int_bitwise_or.i.i.i.i.i.i.i.i.i.i.i.i = or i8 %load_element.i.i.i.i.i.i.i.i.i.i, %int_shift_left.i.i.i.i.i.i.i.i.i, !dbg !309
  %23 = getelementptr inbounds i8, ptr %4, i64 %joinpointarg.i.i.i.i.i.i.i.i.i93, !dbg !316
  store i8 %int_bitwise_or.i.i.i.i.i.i.i.i.i.i.i.i, ptr %23, align 1, !dbg !316
  br label %List_update_8c3e23999fb4a9a5772421e445946d011124a10dbca352b99c3db8b699f4098.exit.i.i.i.i.i.i.i.i.i

List_update_8c3e23999fb4a9a5772421e445946d011124a10dbca352b99c3db8b699f4098.exit.i.i.i.i.i.i.i.i.i: ; preds = %else_block.i.i.i.i.i.i.i.i.i.i, %"#Attr_#inc_2.exit.i.i.i.i.i.i.i.i.i.i"
  call void @llvm.lifetime.end.p0(i64 2, ptr nonnull %result_value.i.i.i.i.i.i.i.i.i.i), !dbg !160
  %gte_uint.i.i.i.i.i.i.i.i.i.i = icmp ugt i64 %operation_result.i.i.i.i.i.i.i.i.i.i, 16383, !dbg !300
  br i1 %gte_uint.i.i.i.i.i.i.i.i.i.i, label %joinpointcont.i.i.i.i.i.i.i.i.loopexit, label %else_block.i.i.i.i.i.i.i.i.i, !dbg !302

List.get

define internal fastcc void @List_get_4bc3befd4127054416125539e7bc161699db1de9fc4587f06d2ee1b827f164(%list.RocList %list, i64 %index, ptr nocapture writeonly %0) unnamed_addr #8 !dbg !89 {
entry:
  %list_len.i = extractvalue %list.RocList %list, 1, !dbg !90
  %lt_uint.i = icmp ugt i64 %list_len.i, %index, !dbg !95
  br i1 %lt_uint.i, label %then_block, label %else_block, !dbg !99

common.ret:                                       ; preds = %"#Attr_#dec_2.exit", %"#Attr_#dec_2.exit17"
  %tag_alloca3.sink = phi ptr [ %tag_alloca3, %"#Attr_#dec_2.exit" ], [ %tag_alloca, %"#Attr_#dec_2.exit17" ]
  %.sink = phi i8 [ 0, %"#Attr_#dec_2.exit" ], [ 1, %"#Attr_#dec_2.exit17" ]
  %tag_id_ptr5 = getelementptr inbounds { [0 x i8], [1 x i8], i8, [0 x i8] }, ptr %tag_alloca3.sink, i64 0, i32 2, !dbg !99
  store i8 %.sink, ptr %tag_id_ptr5, align 1, !dbg !99
  %storemerge = load i16, ptr %tag_alloca3.sink, align 8, !dbg !99
  store i16 %storemerge, ptr %0, align 1, !dbg !99
  ret void, !dbg !99

then_block:                                       ; preds = %entry
  %read_list_ptr.i = extractvalue %list.RocList %list, 0, !dbg !100
  %list_get_element.i = getelementptr inbounds i8, ptr %read_list_ptr.i, i64 %index, !dbg !100
  %list_get_load_element.i = load i8, ptr %list_get_element.i, align 1, !dbg !100
  %list_capacity_or_ref_ptr.i7 = extractvalue %list.RocList %list, 2, !dbg !104
  %"cap > 0.i8.not" = icmp eq i64 %list_capacity_or_ref_ptr.i7, 0, !dbg !104
  br i1 %"cap > 0.i8.not", label %"#Attr_#dec_2.exit17", label %modification_list_block.i15, !dbg !104

modification_list_block.i15:                      ; preds = %then_block
  %1 = ptrtoint ptr %read_list_ptr.i to i64, !dbg !104
  %2 = shl i64 %list_capacity_or_ref_ptr.i7, 1, !dbg !104
  %isneg.i.i.i12 = icmp slt i64 %list_capacity_or_ref_ptr.i7, 0, !dbg !104
  %3 = select i1 %isneg.i.i.i12, i64 %2, i64 %1, !dbg !104
  %4 = inttoptr i64 %3 to ptr, !dbg !104
  %get_rc_ptr.i13 = getelementptr inbounds i64, ptr %4, i64 -1, !dbg !104
  %5 = load i64, ptr %get_rc_ptr.i13, align 8, !dbg !108
  %.not.i.i.i14 = icmp eq i64 %5, 0, !dbg !108
  br i1 %.not.i.i.i14, label %"#Attr_#dec_2.exit17", label %6, !dbg !108

6:                                                ; preds = %modification_list_block.i15
  %7 = add i64 %5, -1, !dbg !108
  store i64 %7, ptr %get_rc_ptr.i13, align 8, !dbg !108
  %8 = icmp eq i64 %5, -9223372036854775808, !dbg !108
  br i1 %8, label %9, label %"#Attr_#dec_2.exit17", !dbg !108

9:                                                ; preds = %6
  tail call void @roc_dealloc(ptr nonnull align 1 %get_rc_ptr.i13, i32 8), !dbg !108
  br label %"#Attr_#dec_2.exit17", !dbg !108

"#Attr_#dec_2.exit17":                            ; preds = %modification_list_block.i15, %6, %9, %then_block
  %tag_alloca = alloca { [0 x i8], [1 x i8], i8, [0 x i8] }, align 8, !dbg !99
  %data_buffer = getelementptr inbounds { [0 x i8], [1 x i8], i8, [0 x i8] }, ptr %tag_alloca, i64 0, i32 1, !dbg !99
  store i8 %list_get_load_element.i, ptr %data_buffer, align 8, !dbg !99
  br label %common.ret

else_block:                                       ; preds = %entry
  %list_capacity_or_ref_ptr.i = extractvalue %list.RocList %list, 2, !dbg !112
  %"cap > 0.i.not" = icmp eq i64 %list_capacity_or_ref_ptr.i, 0, !dbg !112
  br i1 %"cap > 0.i.not", label %"#Attr_#dec_2.exit", label %modification_list_block.i, !dbg !112

modification_list_block.i:                        ; preds = %else_block
  %list.elt = extractvalue %list.RocList %list, 0, !dbg !112
  %10 = ptrtoint ptr %list.elt to i64, !dbg !112
  %11 = shl i64 %list_capacity_or_ref_ptr.i, 1, !dbg !112
  %isneg.i.i.i = icmp slt i64 %list_capacity_or_ref_ptr.i, 0, !dbg !112
  %12 = select i1 %isneg.i.i.i, i64 %11, i64 %10, !dbg !112
  %13 = inttoptr i64 %12 to ptr, !dbg !112
  %get_rc_ptr.i = getelementptr inbounds i64, ptr %13, i64 -1, !dbg !112
  %14 = load i64, ptr %get_rc_ptr.i, align 8, !dbg !114
  %.not.i.i.i = icmp eq i64 %14, 0, !dbg !114
  br i1 %.not.i.i.i, label %"#Attr_#dec_2.exit", label %15, !dbg !114

15:                                               ; preds = %modification_list_block.i
  %16 = add i64 %14, -1, !dbg !114
  store i64 %16, ptr %get_rc_ptr.i, align 8, !dbg !114
  %17 = icmp eq i64 %14, -9223372036854775808, !dbg !114
  br i1 %17, label %18, label %"#Attr_#dec_2.exit", !dbg !114

18:                                               ; preds = %15
  tail call void @roc_dealloc(ptr nonnull align 1 %get_rc_ptr.i, i32 8), !dbg !114
  br label %"#Attr_#dec_2.exit", !dbg !114

"#Attr_#dec_2.exit":                              ; preds = %modification_list_block.i, %15, %18, %else_block
  %tag_alloca3 = alloca { [0 x i8], [1 x i8], i8, [0 x i8] }, align 8, !dbg !99
  br label %common.ret
}

Brendan Hansknecht (Jan 27 2024 at 23:04):

Folkert de Vries (Jan 27 2024 at 23:05):

GordonBGood (Jan 28 2024 at 00:02):

I miss-counted number of suffixes for the branch back to the beginning of the loop.

Brendan Hansknecht (Jan 28 2024 at 00:07):

List.get is kinda ridiculous cause we don't have borrow inference. It should just be a size check, grabbing the element, and returning a result.

Due to lack of borrowing, it also has to decrement the refcount and maybe free the entire list. So a significant chunk of the code is spent handing refcounting and deallocation related things

Brendan Hansknecht (Jan 28 2024 at 00:25):

With borrowing, list.get becomes this, which is much more reasonable. Still has the messed up allocas that stop inline and some other llvm optimizations, but a reasonable sized function now.

List.get with borrowing (aka no refcounting decrement)

define internal fastcc void @List_get_4bc3befd4127054416125539e7bc161699db1de9fc4587f06d2ee1b827f164(%list.RocList %list, i64 %index, ptr nocapture writeonly %0) unnamed_addr #8 !dbg !89 {
entry:
  %list_len.i = extractvalue %list.RocList %list, 1, !dbg !90
  %lt_uint.i = icmp ugt i64 %list_len.i, %index, !dbg !95
  br i1 %lt_uint.i, label %then_block, label %else_block, !dbg !99

common.ret:                                       ; preds = %else_block, %then_block
  %tag_alloca3.sink = phi ptr [ %tag_alloca3, %else_block ], [ %tag_alloca, %then_block ]
  %.sink = phi i8 [ 0, %else_block ], [ 1, %then_block ]
  %tag_id_ptr5 = getelementptr inbounds { [0 x i8], [1 x i8], i8, [0 x i8] }, ptr %tag_alloca3.sink, i64 0, i32 2, !dbg !99
  store i8 %.sink, ptr %tag_id_ptr5, align 1, !dbg !99
  %storemerge = load i16, ptr %tag_alloca3.sink, align 8, !dbg !99
  store i16 %storemerge, ptr %0, align 1, !dbg !99
  ret void, !dbg !99

then_block:                                       ; preds = %entry
  %read_list_ptr.i = extractvalue %list.RocList %list, 0, !dbg !100
  %list_get_element.i = getelementptr inbounds i8, ptr %read_list_ptr.i, i64 %index, !dbg !100
  %list_get_load_element.i = load i8, ptr %list_get_element.i, align 1, !dbg !100
  %tag_alloca = alloca { [0 x i8], [1 x i8], i8, [0 x i8] }, align 8, !dbg !99
  %data_buffer = getelementptr inbounds { [0 x i8], [1 x i8], i8, [0 x i8] }, ptr %tag_alloca, i64 0, i32 1, !dbg !99
  store i8 %list_get_load_element.i, ptr %data_buffer, align 8, !dbg !99
  br label %common.ret

else_block:                                       ; preds = %entry
  %tag_alloca3 = alloca { [0 x i8], [1 x i8], i8, [0 x i8] }, align 8, !dbg !99
  br label %common.ret
}

Brendan Hansknecht (Jan 28 2024 at 00:35):

For completeness, after fixing the alloca issue as well and rerunning the llvm passes, the entire loop turns into this:

the expected ir

List_update_8c3e23999fb4a9a5772421e445946d011124a10dbca352b99c3db8b699f4098.exit.i.i.i.i.i.i.i.i.i: ; preds = %List_update_8c3e23999fb4a9a5772421e445946d011124a10dbca352b99c3db8b699f4098.exit.i.i.i.i.i.i.i.i.i, %else_block.i.i.i.i.i.i.i.i.i.preheader
  %joinpointarg.i.i.i.i.i.i.i.i.i77 = phi i64 [ %operation_result.i.i.i.i.i.i.i.i.i.i, %List_update_8c3e23999fb4a9a5772421e445946d011124a10dbca352b99c3db8b699f4098.exit.i.i.i.i.i.i.i.i.i ], [ %int_shift_right.i.i.i.i.i.i.i.i.i, %else_block.i.i.i.i.i.i.i.i.i.preheader ]
  %operation_result.i.i.i.i.i.i.i.i.i.i = add i64 %joinpointarg.i.i.i.i.i.i.i.i.i77, %operation_result.i61.i.i.i.i.i.i.i, !dbg !239
  %list_get_element.i.i25 = getelementptr inbounds i8, ptr %8, i64 %joinpointarg.i.i.i.i.i.i.i.i.i77, !dbg !241
  %list_get_load_element.i.i26 = load i8, ptr %list_get_element.i.i25, align 1, !dbg !241
  %int_bitwise_or.i.i.i.i.i.i.i.i.i.i.i.i = or i8 %list_get_load_element.i.i26, %int_shift_left.i.i.i.i.i.i.i.i.i, !dbg !247
  store i8 %int_bitwise_or.i.i.i.i.i.i.i.i.i.i.i.i, ptr %list_get_element.i.i25, align 1, !dbg !254
  %gte_uint.i.i.i.i.i.i.i.i.i.i = icmp ugt i64 %operation_result.i.i.i.i.i.i.i.i.i.i, 16383, !dbg !231
  br i1 %gte_uint.i.i.i.i.i.i.i.i.i.i, label %joinpointcont.i.i.i.i.i.i.i.i.loopexit, label %List_update_8c3e23999fb4a9a5772421e445946d011124a10dbca352b99c3db8b699f4098.exit.i.i.i.i.i.i.i.i.i, !dbg !236

Asher Mancinelli (Jan 29 2024 at 03:11):

I do notice this in the part of LLVM's inlining analysis (Analysis/InlineCost.cpp) that looks at allocas:

I might add a flag to this pass/disable locally and rerun opt to see if that helps. Given that there's a fixme right above this snippet, I bet we could upstream a patch to add a flag to tune this behavior while keeping the default the same.

Brendan Hansknecht (Jan 29 2024 at 03:18):

That would be really awesome. Though we should still fix our alloca generation. In their guide for frontend authors, llvm is very clear that allocas outside of the entry block mess with a number of optimization: https://llvm.org/docs/Frontend/PerformanceTips.html#use-of-allocas

In our specific case, even if you move the two allocas to the entry block, another very important piece for performance is merging the two allocas into a single alloca. I assume llvm is unwilling to reason about a phi node that takes two allocas (probably do to aliasing rules). As such, it won't recognize the many following optimizations that can happen by writing the output directly.

GordonBGood (Jan 29 2024 at 09:07):

That looks almost exactly like the C loop other than that the get pointer is still maybe doing a bounds check; it should run about the same or at least not much slower than the C code.

Brendan Hansknecht (Jan 29 2024 at 20:04):

It is increment the element offset. Calculate the memory address relative to the list pointer. Load the element. Bitwise or it. Store it back. Length check. Loop back to beginning.

Brendan Hansknecht (Jan 29 2024 at 20:05):

So compared to c, should be one extra instruction. C is updating the pointer directly. Roc is updating an index and then calculate the correct pointer. So should be one extra lea instruction.

GordonBGood (Jan 29 2024 at 21:11):

That's still increases the number of CPU clock cycles for this tight loop of about 1.5 cycles in C by at least 0.25 cycles (if some care is taken in indexing mode, 0.5 cycles for complex indexing which I don't think will be used here - Zen 4) for an increase of 16.7% to 33.3%.

So if C takes 70 milliseconds, Roc with all of your optimizations will take either 82 or 93 milliseconds on Zen 4; a little faster than the about 100 milliseconds you forced before on Intel but I think you've advanced your optimizations a little more since then by assuming that Roc will have automatic bounds check eliding...

Brendan Hansknecht (Jan 29 2024 at 21:24):

Actually, llvm was smart enough to pull the bounds check out of the loop and instead use a single check before the loop. So nothing needed on the roc side at least for a direct case like this.

Would depend some on cache timings, right? Cause all the extra instruction time may be hidden by waiting on loading elements from the l1 cache (probably takes about 4 cycles)? Right?

Actually, looking at the C code, it does cmpsts[c] which also should be a lea instruction. So I would expect it to generate the exact same assembly for the innermost loop.

GordonBGood (Jan 30 2024 at 00:26):

Although modifying the L1 cache has a latency of about 4 clock cycles, the throughput is about 1 cycle once the read/modify/write phases have been combined into one instruction, but that doesn't have much to do with uses of the LEA instruction, which is a register instruction.

It depends on the compiler: LLVM tends to use a LEA instruction and a simple addressing mode on the read/modify/write instruction; GCC tends to skip the LEA and combine the indexing in the read/modify/write instruction. The difference in timing isn't much and also depends on the CPU on which the code is run, but often the GCC way is a tiny bit faster.

My C code timing used GCC but I don't have a working computer just now to output the assembler (the -S command line option). It's often fun to compare the output of GCC to that of Clang, which uses LLVM.

Brendan Hansknecht (Jan 30 2024 at 00:44):

In Roc with the borrowing hack and the alloc fix. We also essentially get the same assembly for the inner loop:

Brendan Hansknecht (Jan 30 2024 at 00:45):

All of them optimize to essentially the same 4 instructions (all with a slightly different comparison and jump instruction which is interesting)

GordonBGood (Jan 30 2024 at 01:26):

Yes, all of them should produce identical execution times, but the differences are interesting.

Brendan Hansknecht (Jan 30 2024 at 02:06):

Had to look it up cause I never remember. JB is the unsigned version and JL is the signed version.

GordonBGood (Mar 15 2024 at 23:14):

I see you've made some commits preparing for a solution to the low performance for this type of low level code but as of current nightlies the code is still performing slowly. Any estimates on when this might be resolved?

Brendan Hansknecht (Mar 15 2024 at 23:27):

My commits were just hacks to prove out the concepts required. The real changes are larger projects. I sadly have not had the time/energy to tackle them. So mostly they are just documented and waiting.

Stream: show and tell

Topic: Low Roc "bit-twiddling" performance...

GordonBGood (Jan 26 2024 at 06:37):

Brendan Hansknecht (Jan 26 2024 at 07:06):

Brendan Hansknecht (Jan 26 2024 at 07:06):

GordonBGood (Jan 26 2024 at 07:12):

Brendan Hansknecht (Jan 26 2024 at 07:12):

Brian Carroll (Jan 26 2024 at 07:34):

Brendan Hansknecht (Jan 26 2024 at 07:40):

Brendan Hansknecht (Jan 26 2024 at 07:42):

Brendan Hansknecht (Jan 26 2024 at 07:48):

GordonBGood (Jan 26 2024 at 08:01):

GordonBGood (Jan 26 2024 at 08:11):

Luke Boswell (Jan 26 2024 at 08:15):

GordonBGood (Jan 26 2024 at 08:15):

GordonBGood (Jan 26 2024 at 09:05):

GordonBGood (Jan 26 2024 at 10:07):

Folkert de Vries (Jan 26 2024 at 15:39):

Brendan Hansknecht (Jan 26 2024 at 16:09):

Brendan Hansknecht (Jan 26 2024 at 16:10):

Brendan Hansknecht (Jan 26 2024 at 16:18):

Brendan Hansknecht (Jan 26 2024 at 17:25):

Brendan Hansknecht (Jan 26 2024 at 17:41):

Brendan Hansknecht (Jan 26 2024 at 17:43):

Brendan Hansknecht (Jan 26 2024 at 17:43):

Brendan Hansknecht (Jan 26 2024 at 17:45):

Brendan Hansknecht (Jan 26 2024 at 17:53):

GordonBGood (Jan 26 2024 at 19:21):

GordonBGood (Jan 26 2024 at 19:40):

Brendan Hansknecht (Jan 26 2024 at 20:14):

Brendan Hansknecht (Jan 26 2024 at 20:15):

GordonBGood (Jan 27 2024 at 02:33):

Brendan Hansknecht (Jan 27 2024 at 02:38):

GordonBGood (Jan 27 2024 at 02:45):

Brendan Hansknecht (Jan 27 2024 at 03:18):

Brendan Hansknecht (Jan 27 2024 at 03:19):

GordonBGood (Jan 27 2024 at 03:21):

GordonBGood (Jan 27 2024 at 06:39):

Brendan Hansknecht (Jan 27 2024 at 07:01):

Brendan Hansknecht (Jan 27 2024 at 07:02):

Brendan Hansknecht (Jan 27 2024 at 07:02):

GordonBGood (Jan 27 2024 at 07:14):

Brendan Hansknecht (Jan 27 2024 at 23:03):

Brendan Hansknecht (Jan 27 2024 at 23:04):

Folkert de Vries (Jan 27 2024 at 23:05):

Folkert de Vries (Jan 27 2024 at 23:05):

GordonBGood (Jan 28 2024 at 00:02):

Brendan Hansknecht (Jan 28 2024 at 00:07):

Brendan Hansknecht (Jan 28 2024 at 00:25):

Brendan Hansknecht (Jan 28 2024 at 00:35):

Asher Mancinelli (Jan 29 2024 at 03:11):

Brendan Hansknecht (Jan 29 2024 at 03:18):

GordonBGood (Jan 29 2024 at 09:07):

Brendan Hansknecht (Jan 29 2024 at 20:04):

Brendan Hansknecht (Jan 29 2024 at 20:05):

GordonBGood (Jan 29 2024 at 21:11):

Brendan Hansknecht (Jan 29 2024 at 21:24):

GordonBGood (Jan 30 2024 at 00:26):

Brendan Hansknecht (Jan 30 2024 at 00:44):

Brendan Hansknecht (Jan 30 2024 at 00:45):

GordonBGood (Jan 30 2024 at 01:26):

Brendan Hansknecht (Jan 30 2024 at 02:06):

GordonBGood (Mar 15 2024 at 23:14):

Brendan Hansknecht (Mar 15 2024 at 23:27):