I've been experimenting with different ways to count the number of backslashes in a large file. I wrote a function which I hope will be branchless (but haven't checked the dissasembly yet) however it is giving me an incorrect answer and I am not sure if I am doing something wrong, or maybe we have a bug somewhere.
The below seems to work ok, the below tests pass etc... however when I run it on a large file of UTF-8 bytes that I know has 1230 backslashes it gives me an incorrect answer of 24_642_180.
Can anyone see anything obviously wrong with this helper?
countSlashesBranchlessHelp : Nat, U8 -> Nat
countSlashesBranchlessHelp = \count, byte ->
byte
|> Num.bitwiseXor '\\' # XOR will be 0 if values are equal
|> Num.subWrap 0b0000_0001 # MSB 1 if values are equal, 0 otherwise
|> Num.shiftRightBy 7u8 # Shift MSB to LSB
|> Num.toNat # Convert to Nat
|> Num.addWrap count # Add to running count
expect List.walk ['1','2','3'] 0 countSlashesBranchlessHelp == 0
expect List.walk ['1','2','\\','3','\\'] 0 countSlashesBranchlessHelp == 2
It could be valuable to progressively half the input file string with a function to see at which point the count starts going wrong.
I think the xor/subtraction logic is incorrect. For example, for <
, which is ascii code 60 0b0111100
, the xor against \=92
will be 0b1011000 xor 0b0111100 = 0b1100100
. The subtraction and right shift will then end up with 1 in the LSB, if I did my math right.
Thank you @Ayaz Hafiz :thank_you:
Update to my previous. I changed it to the below which now gives the expected answer!
countSlashesBranchlessHelp : Nat, U8 -> Nat
countSlashesBranchlessHelp = \count, byte ->
# XOR will be 0 if values are equal
a = Num.bitwiseXor byte '\\'
# Use shifts to propogate 1's
b = Num.bitwiseOr a (Num.shiftRightBy a 4)
c = Num.bitwiseOr b (Num.shiftRightBy b 2)
d = Num.bitwiseOr c (Num.shiftRightBy c 1)
# Invert the bits
e = Num.bitwiseNot d
# Mask for the LSB
f = Num.bitwiseAnd e 1
# Add to running count
f |> Num.toNat |> Num.addWrap count
Unfortunately, though I think it looks nicer, if I write it out using a pipeline it is roughly 25% slower. I assume this is because it is because the lambdas are not being inlined in the function?
countSlashesBranchlessHelp : Nat, U8 -> Nat
countSlashesBranchlessHelp = \count, byte ->
byte
|> Num.bitwiseXor '\\' # XOR will be 0 if values are equal
|> \c -> Num.bitwiseOr c (Num.shiftRightBy c 4) # Use shifts to propogate 1's
|> \c -> Num.bitwiseOr c (Num.shiftRightBy c 2)
|> \c -> Num.bitwiseOr c (Num.shiftRightBy c 1)
|> \c -> Num.bitwiseNot c # Invert the bits
|> \c -> Num.bitwiseAnd c 1 # Mask for the LSB
|> Num.toNat # Convert to Nat
|> Num.addWrap count # Add to running count
I guess my question here is should I expect the pipeline version to run at the same speed? or is that unreasonable?
we'd have to do our own inlining to guarantee it
though LLVM should get pretty close here
we do want to do our own inlining at some point, but it's not a priority in the near term
Llvm not inlining this surprised me. I wonder if something else is wrong
Last updated: Jul 05 2025 at 12:14 UTC