Stream: show and tell

Topic: Low Roc "bit-twiddling" performance...


view this post on Zulip GordonBGood (Jan 26 2024 at 06:37):

There have been some performance benchmarks written that show that Roc is "in there" as to running a simple QuickSort as compared to other languages; however, that may not be low level enough to really show comparative performance. I posted an issue showing that currently Roc is over twenty times slower than languages such as C/C++ in doing truly low level stuff. The issue pretty well says it all, but I thought I would sign on here to see if any of you had any ideas about what I might not be doing correctly or what might be done about it...

view this post on Zulip Brendan Hansknecht (Jan 26 2024 at 07:06):

Just a note, there is List.repeat

view this post on Zulip Brendan Hansknecht (Jan 26 2024 at 07:06):

Also, thanks for posting, will be interesting to figure out what is going on.

view this post on Zulip GordonBGood (Jan 26 2024 at 07:12):

@Brendan Hansknecht, Thanks for the suggestion on the List.repeat; actually, that was what I tried first and got a message of "not yet implemented" and came up with the kludge. Just tried it and it works fine now. At any rate, it won't affect the benchmark as creating the buffer is only a tiny part of the overall execution time...

view this post on Zulip Brendan Hansknecht (Jan 26 2024 at 07:12):

Yeah, I guess it isn't implemented on the dev backend yet so doesn't work in the repl.

view this post on Zulip Brian Carroll (Jan 26 2024 at 07:34):

@GordonBGood we have three code generation back ends at different stages of development. For what you are doing you should run the compiler from the command line with optimisations on. List.repeat should work there.
It should also work on the repl on the website.

view this post on Zulip Brendan Hansknecht (Jan 26 2024 at 07:40):

Oh, I found the issues: Found 23000 primes to 262146 in 111 milliseconds.

We probably need bounds check hoisting.

view this post on Zulip Brendan Hansknecht (Jan 26 2024 at 07:42):

Every iteration of the innermost loop, we are checking the bound. That is not necessary cause loop already has a check if c >= 16384 then cp else

If roc could statically analyze that, it could remove the bounds check and have the perf I posted above by manually removing the check

view this post on Zulip Brendan Hansknecht (Jan 26 2024 at 07:48):

Also, id be curious to see how c++ does if you use a std::vector and only access it through checked methods. That would be the equivalent to what roc should be generating

view this post on Zulip GordonBGood (Jan 26 2024 at 08:01):

Brian Carroll said:

GordonBGood we have three code generation back ends at different stages of development. For what you are doing you should run the compiler from the command line with optimisations on. List.repeat should work there.

Yes, we are past that as List.repeat works when optimization is turned on...

view this post on Zulip GordonBGood (Jan 26 2024 at 08:11):

@Brendan Hansknecht,

We probably need bounds check hoisting.
Every iteration of the innermost loop, we are checking the bound.

Thirty-five CPU clock cycles per array bounds check is pretty expensive bounds checking. I suspect that the bounds check is triggering also doing something else such as not inlining a function of not lifting or something, which is why I asked if it is possible to view the assembly output.

Yes, automatic bounds checking elision is something that sophisticated compilers do, but when optimized properly, array bounds checks should only take a CPU clock cycle or less.

Also, id be curious to see how c++ does if you use a std::vector and only access it through checked methods. That would be the equivalent to what roc should be generating

I don't know about C++, but the bounds check in Rust costs something about a CPU cycle each check; I'd assume that it would be about the same with C++...

view this post on Zulip Luke Boswell (Jan 26 2024 at 08:15):

May or may not be useful, but you can use --emit-llvm-ir if you want to generate a .ll file and see what roc is generating.

view this post on Zulip GordonBGood (Jan 26 2024 at 08:15):

@Brendan Hansknecht,

Every iteration of the innermost loop, we are checking the bound. That is not necessary cause loop already has a check if c >= 16384 then cp else

Most compilers such as F#/C# that do bounds check elision require that the loop upper bound be the explicitly the length of the array as in "if c >= (cmpsts |> List.len) then cp else". I tried that with no appreciable difference in execution time

view this post on Zulip GordonBGood (Jan 26 2024 at 09:05):

@Luke Boswell,

May or may not be useful, but you can use --emit-llvm-ir if you want to generate a .ll file and see what roc is generating.

Did that, but too much code to easily follow with obfuscated labels as to what part of the source code generated them. When looking at assembly code, I can usually identify the innermost loop by the short loop with an "or" bitwise operation applied to 8-bit operands. Could run the .ll code through LLVM's opt and llc programs to generate the assembly, but don't know what arguments are applied

view this post on Zulip GordonBGood (Jan 26 2024 at 10:07):

@Brendan Hansknecht,

Oh, I found the issues: Found 23000 primes to 262146 in 111 milliseconds.

What was the clock speed for the above test? And what was the issue? Assuming that the above output is after the issue is fixed, and likely still includes bounds checking?

Bounds checking, especially for Nat indices is very cheap if the compiler is efficiently using registers, which LLVM generated code usually does: comparing the index in a register to the array size in another register takes a quarter or third of a CPU clock cycle and a check and branch on condition overflow takes about zero time when it matches the branch prediction, which will normally be to not take the overflow branch.

Even when if Int indices are used, it takes just another branch on register negative which will still match the prediction of not taking the branch in the normal case.

It may be barely worth all the work of detecting and eliding the bounds check away when not needed...

view this post on Zulip Folkert de Vries (Jan 26 2024 at 15:39):

@GordonBGood do you have a small piece of code the reproduces the List.repeat TODO? it works just fine in the repl for a number of types so I'm confused what the issue could be

view this post on Zulip Brendan Hansknecht (Jan 26 2024 at 16:09):

the bounds check in Rust costs something about a CPU cycle each check

I don't think there will be a bounds check in the F# or Rust. They are using a static sized array. As such, it should be trivial to know the size and avoid the bounds check completely.

view this post on Zulip Brendan Hansknecht (Jan 26 2024 at 16:10):

"if c >= (cmpsts |> List.len) then cp else". I tried that with no appreciable difference in execution time

Yeah, I noticed that too, which is definitely odd.

view this post on Zulip Brendan Hansknecht (Jan 26 2024 at 16:18):

What was the clock speed for the above test? And what was the issue? Assuming that the above output is after the issue is fixed, and likely still includes bounds checking?

I removed the bounds check from List.update which is called on repeat in the innermost loop. That is the only change.

On my machine, timing when from about 1500ms to about 100ms. The posted c test takes about 70ms for reference.

That is on an intel linux machine with a Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz. Of course that is base freq. Peak should be around 4GHz though sustain is closer to 3GHz IIRC.

view this post on Zulip Brendan Hansknecht (Jan 26 2024 at 17:25):

Ok I have been digging into the IR and here are the two main problems from what I can see:

  1. List.get is a regular function, not a low level (and we don't have borrow inference). As such, before every call to the function we increment the refcount and in the function we decrement the refcount (checking if we need to deallocate each time).
  2. The same root cause as #6213: Having allocas in locations other than the entry block stops llvm from being able to inline List.get. LLVM is smart enough to realize that inlining that function with allocas in the wrong place would lead to a stack space explosion. As such, it refuses to inline the function unless the the allocas are fixed.

So I did some testing to verify. One, I updated and used my borrowing-hacks branch to avoid the first issue. Then I manually edited the llvm ir to fix the second issue.
The result: Found 23000 primes to 262146 in 103 milliseconds.

Note: both changes are need for the perf.

No changes: 1.5s
Just borrowing-hacks: 1.2s
Just alloca fix: 1.0s
both: 100ms

view this post on Zulip Brendan Hansknecht (Jan 26 2024 at 17:41):

Oh, one extra note on the alloca. Each branch is generating it's own output alloca and then we are using a phi node to pick the one we want. LLVM doesn't seem to know how to optimize that. It is important that each branch is writing to the same alloca.

view this post on Zulip Brendan Hansknecht (Jan 26 2024 at 17:43):

In fact, that is actually the biggest issue, my line labelled Just alloca fix above was still using a phi node.

This is just alloca fix without the phi node: 200ms

view this post on Zulip Brendan Hansknecht (Jan 26 2024 at 17:43):

So that still has all of the extra refcounting

view this post on Zulip Brendan Hansknecht (Jan 26 2024 at 17:45):

Sounds like we have some pretty big llvm ir changes to make we need to:

  1. Ensure all allocas are always in the entry block.
  2. reuse allocas in different branches that are generating the same result.
  3. optional but probably good practice: write to the output pointer instead of making an alloca when we have the choice.

view this post on Zulip Brendan Hansknecht (Jan 26 2024 at 17:53):

Made an overall tracking bug: #6434

view this post on Zulip GordonBGood (Jan 26 2024 at 19:21):

@Folkert de Vries,

do you have a small piece of code the reproduces the List.repeat TODO

I'm sorry, I can no longer reproduce the issue. It was said by others that this function didn't yet get added to some DEV releases, but I just tested and it now works in my sample code, so I don't know what the original problem was.

view this post on Zulip GordonBGood (Jan 26 2024 at 19:40):

@Brendan Hansknecht,

I don't think there will be a bounds check in the F# or Rust. They are using a static sized array. As such, it should be trivial to know the size and avoid the bounds check completely.

F# to DotNet will put in the bounds check unless the upper bound in the check is specifically specified as cmpsts.Length as that is what is required by DotNet to do the bounds check eliding. Fable to JavaScript has no bounds check because JavaScript doesn't handle this without a lot of overhead. Rust, being the "safe" language that it is will put in the bounds check although I'm not sure about bounds check eliding for specific cases; if one want to guaranty no bounds check for Rust vectors, one needs to use the unsafe get pointer manipulations as far as I know, although things may have changed with new version.

"if c >= (cmpsts |> List.len) then cp else". I tried that with no appreciable difference in execution time

Yeah, I noticed that too, which is definitely odd.

It seems to just just mean that you don't have bounds check eliding working yet.

On my machine, timing when from about 1500ms to about 100ms. The posted c test takes about 70ms for reference.

That is very similar to as on my machine. When the tests get short as in 70 or 100 milliseconds, testing on laptop CPU's (just as mine also basically is) gets a bit strange as to prediction of what the boost CPU clock rates will be, as depending on the power profile it may take more time than the length of the test to fully ramp the clock rate up to the full boost rate. I suspect that the C reference is actually running at about 1.5 clock cycles per loop rather than the calculated about 1.75 based on maximum boost, which means that, at least on my machine, this isn't long enough to fully ramp up the clock rate. This is verified by running the test for 10,000 and 100,000 loops and observing that the time is less than as proportional to the number of extra loops.

No changes: 1.5s
Just borrowing-hacks: 1.2s
Just alloca fix: 1.0s
both: 100ms

That is very interesting, that both "fixes" need to be applied to get expected behavior.

Sounds like we have some pretty big llvm ir changes to make we need

Yes, it looks pretty complex!!!

Thanks for your replies and your efforts.

view this post on Zulip Brendan Hansknecht (Jan 26 2024 at 20:14):

Yeah, one of the annoying things with llvm is that it is made for c/c++ like code. With that, it expects the IR to be in specific forms or it will give up on many optimizations. Since our IR is not as llvm expects, we are hitting loss of optimizations and overall poor perf.

view this post on Zulip Brendan Hansknecht (Jan 26 2024 at 20:15):

That said, it has gotten more general over the years as more languages target it

view this post on Zulip GordonBGood (Jan 27 2024 at 02:33):

I edited the Roc code in the OP of the issue to clean up the code a bit and to use List.repeat instead of the kludge. I also nest the closures as, if nesting closures turns out to be the problem, then it needs to be fixed...

view this post on Zulip Brendan Hansknecht (Jan 27 2024 at 02:38):

:+1:

view this post on Zulip GordonBGood (Jan 27 2024 at 02:45):

@Brendan Hansknecht .

Yeah, one of the annoying things with llvm is that it is made for c/c++ like code. With that, it expects the IR to be in specific forms or it will give up on many optimizations. Since our IR is not as llvm expects, we are hitting loss of optimizations and overall poor perf.

Well, GHC Haskell also has an optional LLVM back-end and that works well enough; in fact, most times it produces better code than GHC's Native Code Generator, especially as related to register allocations.

I don't know much about the Roc code base, but I wonder if you are getting too deep into the complexity of handling all the variations through Rust imperative language details instead of stepping back and looking how it would usually be done functionally, as Unique/Linear Types have been around for quite a long time and implemented in many languages. I wonder if using Graph data flow analysis in generation of AST with uniqueness flags already generated might save some grief on the back-end code generation...

All your references to alloca manipulations would seem to be a concern as to reliable and easy to maintain code, but that is just my two cents...

view this post on Zulip Brendan Hansknecht (Jan 27 2024 at 03:18):

Well, GHC Haskell also has an optional LLVM back-end and that works well enough

For sure, but that is because their backend has been written in a way to generate the type of ir that llvm expects. In general, the closer to a c/c++ style ir that is generated the more perf you will get out of llvm. Our backend will do the same over time, but currently we have cases were we do things that don't play nicely with the llvm optimizations. Like this.

All your references to alloca manipulations would seem to be a concern as to reliable and easy to maintain code, but that is just my two cents...

I am sure that GHC and all other llvm backends do the type of alloca handling that I mentioned above. It is essentially a requirement to get llvm to optimize your code well. alloca is just part of life when generating llvm ir.

That said, this ties into what I meant when I said that llvm is that it is made for c/c++ like code. A functional language specific llvm pass that makes different assumptions about pointer and aliasing rules could totally optimize these alloca instructions correctly.
llvm sadly assumes that all allocas will always be at the beginning of a function (that is what c does after all). It is smart enough to generally do hoisting correctly when inlining, but it can generate very broken code when combining some of these assumptions with tail calls. llvm also assumes that essentially all pointers can alias. This ruins tons of optimizations that would be obvious in other contexts.

view this post on Zulip Brendan Hansknecht (Jan 27 2024 at 03:19):

Anyway, all that to say our current backend is not generating what llvm expects and we need to change that

view this post on Zulip GordonBGood (Jan 27 2024 at 03:21):

@Brendan Hansknecht,

Thank you for your explanation - it looks like you have a long row to hoe... ;-)

view this post on Zulip GordonBGood (Jan 27 2024 at 06:39):

@Brendan Hansknecht,

I did "--emit-llvm-ir" on the optimized example and the innermost loop looks like the following:

else_block.i.i.i.i.i.i.i.i.i.i:                   ; preds = %"#Attr_#inc_1.exit.i.i.i.i.i.i.i.i.i.i"
  %load_element.i.i.i.i.i.i.i.i.i.i = load i8, ptr %get_opaque_data_ptr.i.i.i.i.i.i.i.i.i.i, align 8
  %int_bitwise_or.i.i.i.i.i.i.i.i.i.i.i.i = or i8 %load_element.i.i.i.i.i.i.i.i.i.i, %int_shift_left.i.i.i.i.i.i.i.i.i
  %23 = getelementptr inbounds i8, ptr %4, i64 %joinpointarg.i.i.i.i.i.i.i.i.i93
  store i8 %int_bitwise_or.i.i.i.i.i.i.i.i.i.i.i.i, ptr %23, align 1
  br label %List_update_8c3e23999fb4a9a5772421e445946d011124a10dbca352b99c3db8b699f4098.exit.i.i.i.i.i.i.i.i.i

List_update_8c3e23999fb4a9a5772421e445946d011124a10dbca352b99c3db8b699f4098.exit.i.i.i.i.i.i.i.i.i: ; preds = %else_block.i.i.i.i.i.i.i.i.i.i, %"#Attr_#inc_1.exit.i.i.i.i.i.i.i.i.i.i"
  call void @llvm.lifetime.end.p0(i64 2, ptr nonnull %result_value.i.i.i.i.i.i.i.i.i.i)
  %gte_uint.i.i.i.i.i.i.i.i.i.i = icmp ugt i64 %operation_result.i.i.i.i.i.i.i.i.i.i, 16383
  br i1 %gte_uint.i.i.i.i.i.i.i.i.i.i, label %joinpointcont.i.i.i.i.i.i.i.i.loopexit, label %else_block.i.i.i.i.i.i.i.i.i

It actually looks like it would be possible for this to be optimized into some reasonable assembly code, assuming that the load/modify/write sequence gets emitted as a single op code, that the %23 bounds check gets done before the above compound instruction, that the redundant branch to the line following gets elided away, and that the call to do with llvm.lifetime gets elided away. If some of that optimization is not happening, it could explain the slowness.in the lifetime call likely due to the placement of the alloca in the source for this, as you explained

There are no alloca statements in this at all???

It does answer the question about List.update and the update function not being inlined as they obviously are.

As I said before, I would really like to so the assembly code that this is producing! If this is the code after optimization, I can definitely see some potential problems in the lifetime call, which may cost variable amounts of time depending on the arguments passed into it...

view this post on Zulip Brendan Hansknecht (Jan 27 2024 at 07:01):

List.update gets inlined, but List.get called by List.update does not get inlined.

view this post on Zulip Brendan Hansknecht (Jan 27 2024 at 07:02):

The allocas are in List.get

view this post on Zulip Brendan Hansknecht (Jan 27 2024 at 07:02):

Also, I think you don't have the full inner loop. Just part of it. I can try and grab the full thing later for reference.

view this post on Zulip GordonBGood (Jan 27 2024 at 07:14):

@Brendan Hansknecht,

List.update gets inlined, but List.get called by List.update does not get inlined. The allocas are in List.get

Okay, thanks, I understand.

Also, I think you don't have the full inner loop. Just part of it. I can try and grab the full thing later for reference.

The snippet is what I got from primes-bench.ll, but I guess I probably don't completely understand how the whole inner loop gets put together, as I don't see where the index gets advanced per loop...

view this post on Zulip Brendan Hansknecht (Jan 27 2024 at 23:03):

I think this is the code for the innermost loop of the cull function with current roc.
You can see the last 2 lines checking the >= 16384 conditional and then branch back to the first block at the top here.

loop body

Plus this cause it doesn't get inlined:

List.get

view this post on Zulip Brendan Hansknecht (Jan 27 2024 at 23:04):

Also, man, counting the exact number of .i suffixes is a pain.

view this post on Zulip Folkert de Vries (Jan 27 2024 at 23:05):

your editor can tell you how many characters the current selection is, right?

view this post on Zulip Folkert de Vries (Jan 27 2024 at 23:05):

still not ideal of course

view this post on Zulip GordonBGood (Jan 28 2024 at 00:02):

@Brendan Hansknecht,

I think this is the code for the innermost loop of the cull function with current roc.

Thanks for that.

Plus this (List.get - GBG) cause it doesn't get inlined:

That's pretty serious!

You can see the last 2 lines checking the >= 16384 conditional and then branch back to the first block at the top here.

Yes, I identified the end of the loop correctly, but

Also, man, counting the exact number of .i suffixes is a pain.

I miss-counted number of suffixes for the branch back to the beginning of the loop.

view this post on Zulip Brendan Hansknecht (Jan 28 2024 at 00:07):

List.get is kinda ridiculous cause we don't have borrow inference. It should just be a size check, grabbing the element, and returning a result.

Due to lack of borrowing, it also has to decrement the refcount and maybe free the entire list. So a significant chunk of the code is spent handing refcounting and deallocation related things

view this post on Zulip Brendan Hansknecht (Jan 28 2024 at 00:25):

With borrowing, list.get becomes this, which is much more reasonable. Still has the messed up allocas that stop inline and some other llvm optimizations, but a reasonable sized function now.

List.get with borrowing (aka no refcounting decrement)

view this post on Zulip Brendan Hansknecht (Jan 28 2024 at 00:35):

For completeness, after fixing the alloca issue as well and rerunning the llvm passes, the entire loop turns into this:

the expected ir

view this post on Zulip Asher Mancinelli (Jan 29 2024 at 03:11):

I do notice this in the part of LLVM's inlining analysis (Analysis/InlineCost.cpp) that looks at allocas:

  // FIXME: This is overly conservative. Dynamic allocas are inefficient for
  // a variety of reasons, and so we would like to not inline them into
  // functions which don't currently have a dynamic alloca. This simply
  // disables inlining altogether in the presence of a dynamic alloca.
  if (!I.isStaticAlloca())
    HasDynamicAlloca = true;

I might add a flag to this pass/disable locally and rerun opt to see if that helps. Given that there's a fixme right above this snippet, I bet we could upstream a patch to add a flag to tune this behavior while keeping the default the same.

view this post on Zulip Brendan Hansknecht (Jan 29 2024 at 03:18):

That would be really awesome. Though we should still fix our alloca generation. In their guide for frontend authors, llvm is very clear that allocas outside of the entry block mess with a number of optimization: https://llvm.org/docs/Frontend/PerformanceTips.html#use-of-allocas

In our specific case, even if you move the two allocas to the entry block, another very important piece for performance is merging the two allocas into a single alloca. I assume llvm is unwilling to reason about a phi node that takes two allocas (probably do to aliasing rules). As such, it won't recognize the many following optimizations that can happen by writing the output directly.

view this post on Zulip GordonBGood (Jan 29 2024 at 09:07):

@Brendan Hansknecht,

the entire loop turns into this: ...

That looks almost exactly like the C loop other than that the get pointer is still maybe doing a bounds check; it should run about the same or at least not much slower than the C code.

view this post on Zulip Brendan Hansknecht (Jan 29 2024 at 20:04):

No bounds check in the loop.

It is increment the element offset. Calculate the memory address relative to the list pointer. Load the element. Bitwise or it. Store it back. Length check. Loop back to beginning.

view this post on Zulip Brendan Hansknecht (Jan 29 2024 at 20:05):

So compared to c, should be one extra instruction. C is updating the pointer directly. Roc is updating an index and then calculate the correct pointer. So should be one extra lea instruction.

view this post on Zulip GordonBGood (Jan 29 2024 at 21:11):

@Brendan Hansknecht,

So compared to c, should be one extra ... lea instruction.

That's still increases the number of CPU clock cycles for this tight loop of about 1.5 cycles in C by at least 0.25 cycles (if some care is taken in indexing mode, 0.5 cycles for complex indexing which I don't think will be used here - Zen 4) for an increase of 16.7% to 33.3%.

So if C takes 70 milliseconds, Roc with all of your optimizations will take either 82 or 93 milliseconds on Zen 4; a little faster than the about 100 milliseconds you forced before on Intel but I think you've advanced your optimizations a little more since then by assuming that Roc will have automatic bounds check eliding...

view this post on Zulip Brendan Hansknecht (Jan 29 2024 at 21:24):

by assuming that Roc will have automatic bounds check eliding

Actually, llvm was smart enough to pull the bounds check out of the loop and instead use a single check before the loop. So nothing needed on the roc side at least for a direct case like this.

So if C takes 70 milliseconds, Roc with all of your optimizations will take either 82 or 93 milliseconds on Zen 4

Would depend some on cache timings, right? Cause all the extra instruction time may be hidden by waiting on loading elements from the l1 cache (probably takes about 4 cycles)? Right?

Actually, looking at the C code, it does cmpsts[c] which also should be a lea instruction. So I would expect it to generate the exact same assembly for the innermost loop.

view this post on Zulip GordonBGood (Jan 30 2024 at 00:26):

@Brendan Hansknecht,

Actually, llvm was smart enough to pull the bounds check out of the loop and instead use a single check before the loop.

Wow, LLVM really pulled it off there.

(My timing estimates -GBG) Would depend some on cache timings, right?

Although modifying the L1 cache has a latency of about 4 clock cycles, the throughput is about 1 cycle once the read/modify/write phases have been combined into one instruction, but that doesn't have much to do with uses of the LEA instruction, which is a register instruction.

Actually, looking at the C code, it does cmpsts[c] which also should be a lea instruction. So I would expect it to generate the exact same assembly for the innermost loop.

It depends on the compiler: LLVM tends to use a LEA instruction and a simple addressing mode on the read/modify/write instruction; GCC tends to skip the LEA and combine the indexing in the read/modify/write instruction. The difference in timing isn't much and also depends on the CPU on which the code is run, but often the GCC way is a tiny bit faster.

My C code timing used GCC but I don't have a working computer just now to output the assembler (the -S command line option). It's often fun to compare the output of GCC to that of Clang, which uses LLVM.

view this post on Zulip Brendan Hansknecht (Jan 30 2024 at 00:44):

Went to godbolt and got some assembly then dumped the roc assembly:

gcc and clang generate essentially the same thing.
gcc:

.L5:
        or      BYTE PTR [rdi+rax], dl
        add     rax, rsi
        cmp     eax, 16383
        jle     .L5

clang:

.LBB0_6:                                #   Parent Loop BB0_1 Depth=1
        or      byte ptr [rax + r10], r11b
        add     r10, rsi
        cmp     r10, 16384
        jb      .LBB0_6

In Roc with the borrowing hack and the alloc fix. We also essentially get the same assembly for the inner loop:

.LBB13_44:
    or  byte ptr [r13 + rax], r9b
    add rax, rsi
    cmp rax, 16383
    jbe .LBB13_44

view this post on Zulip Brendan Hansknecht (Jan 30 2024 at 00:45):

All of them optimize to essentially the same 4 instructions (all with a slightly different comparison and jump instruction which is interesting)

view this post on Zulip GordonBGood (Jan 30 2024 at 01:26):

@Brendan Hansknecht,

All of them optimize to essentially the same 4 instructions (all with a slightly different comparison and jump instruction which is interesting)

Thanks for that.

Yes, all of them should produce identical execution times, but the differences are interesting.

view this post on Zulip Brendan Hansknecht (Jan 30 2024 at 02:06):

Had to look it up cause I never remember. JB is the unsigned version and JL is the signed version.

Then of course the E suffix is for or equal.

view this post on Zulip GordonBGood (Mar 15 2024 at 23:14):

@Brendan Hansknecht,

I see you've made some commits preparing for a solution to the low performance for this type of low level code but as of current nightlies the code is still performing slowly. Any estimates on when this might be resolved?

view this post on Zulip Brendan Hansknecht (Mar 15 2024 at 23:27):

My commits were just hacks to prove out the concepts required. The real changes are larger projects. I sadly have not had the time/energy to tackle them. So mostly they are just documented and waiting.


Last updated: Jul 06 2025 at 12:14 UTC