I watched Richard Feldman's Roc at Handmade Seattle talk, and I've a question about integer overflows. This is (or at least was back when the talk was given) the kind of code the ROC compiler generates for an overflow-checked integer addition:
add rdi, rsi
jo .LBB1_2
According to the video, the problem with the above code is not so much the presence of one extra machine instruction, but rather the fact that that instruction is a jump, which makes it more difficult for the LLVM compiler to optimize the code.
Now, what I'm going to ask may be the most stupid question ever, but I don't know enough to tell, so I'll dare ask anyway: in a pure functional language without recoverable exceptions like ROC, do we need to check that overflow right away?
Let's suppose we can permanently spare one bit in one of the CPU registers. Let's call that bit STICKY_OVERFLOW_BIT
. We could initialize it to 0 at startup, and than do this after each overflow-checked integer addition:
z = x + y;
STICKY_OVERFLOW_BIT |= ACTUAL_HARDWARE_OVERFLOW_BIT;
If an overflow happens we just record that fact and keep going. From that moment on the memory used by ROC code (as opposed to the memory used by the platform for its own purposes, which will still be fine) will contain garbage, but it seems to me that the problem will remained contained for as long as we avoid I/O. So maybe all we need to do is check the STICKY_OVERFLOW_BIT
before performing the next I/O operation? Or maybe delay that check until we return from the function, since the return instruction is going to confuse the optimizer anyway?
Another possibility would be to check before returning from the current function and also before making a call into any other function. That way we would preserve the correct stack trace (not sure how useful that would be in a release build, but still...) and we wouldn't need to permanently allocate any CPU register space for the sticky overflow bit, since the decision of what (if any) register to use for that purpose could be made on a function-by-function basis.
An even more minimalist possibility would be to delay the check only when the operations we need to check are in a tight loop, or when there's several of them in a row. Take this code, for example:
sum = 0;
for (int i=0 ; i < n ; i++) {
int x = xs[i];
sum += x;
}
On an Intel processor, the carry flag is (I think) bit 0 of the EFLAGS
register. Let's say there's a spare CPU register at that particular location in the code and let's call it STICKY_EFLAGS
. The above code could be checked like this:
sum = 0;
STICKY_EFLAGS = 0;
for (int i=0 ; i < n ; i++) {
int x = xs[i];
sum += x;
STICKY_EFLAGS |= EFLAGS;
}
if (STICKY_EFLAGS & 1)
panic();
The ideal solution of course would be to have a sticky carry bit that is implemented in hardware. So I did a quick search, and it looks like maybe the ARM processor does have something like that? It's called the Q
flag, and it comes with its own QADD
, QSUB
, ... instructions. I wasn't able to find anything similar for Intel processors.
Could that possibly work? Would it provide any benefits? I've been wondering since I watched that talk.
So a few pieces here. I am trying to figure out the right order to present them.
Let's start with the actual issue with the jo .LBB1_2
instruction. This individual instruction actually has about zero cost. It is a jump instruction that is never taken and should have 100% branch prediction accuracy. So on any vaguely modern processor, that cost is super low.
The actual cost comes from simply bloating the number of instructions. In order to load the next useful instruction after the add instruction, the cpu now has to load 1 more instruction than without the jump. So this can lead to performance cost, particularly in hot loops.
side note: the real cost is probably the slot the jump instruction takes up in buffers in the cpu before it is resolved, but it easier to just view it the cost of loading an extra instruction.
When it comes tot using a STICKY_EFLAGS
it has a number likely issues. The main ones I see are:
An extra clarification on which makes it more difficult for the LLVM compiler to optimize the code
LLVM is generating optimized code for what we are asking it to do. In order to get more optimized code, we would need to ask it to do something different. For example, adding a surrounding if branch that ensures integer overflow can't happen will lead to llvm removing all of the jump instructions related to integer overflow.
Thanks for the explanation. I see now it was indeed a stupid question, but as I said, I didn't know enough to figure that out on my own.
It's not stupid, the question is totally valid. There is even a decent chance that implementing the STICKY_EFLAGS
version could be faster for some programs. A branch definitely has more weird constraints in a CPU than other instructions. It is most likely the case that that they would perform almost identical for most programs. The big thing here that makes it so STICKY_EFLAGS
is not likely to be faster is just that our branch should never be taken. If instead this was a branch that was taken even 1% of the time, STICKY_EFLAGS
could be significantly faster.
Last updated: Jul 05 2025 at 12:14 UTC