Hello everyone :)
I've written a BF interpreter in Roc that works (in that it runs scripts correctly), but it's incredibly slow. I picked the first implemtation that I could find online (one that's written in Go) and benchmarked it against mine, it was 9000 times faster on a benchmark script. :sweat_smile:
Here's a link to the repo: https://github.com/Hasnep/brainroc the interpreter is in src/Interpreter.roc
. I'm pretty new to FP, so I'm sure my algorithm is very inefficient. It's very recursive which might be causing some problems, I tried rewriting it using List.walk
insread of recursion, but I struggled with how to handle loops.
Any suggestions to optimise or just improve overall code quality would be very helpful, thanks!
A bit tangential (because it's not about FP but Rust implementation of BF interpreter/compiler), but maybe it can give some insights. Brilliant series of articles (adapted from C++ but anyway)
https://rodrigodd.github.io/
Thanks @Kiryl Dziamura, I'm reading the first post now and it's really cleanly explained, I'm glad to see some of the same ideas I've had appearing there. That'll be really useful when I get round to some more fine grained optimisations, but I think I still have some more fundamental problems with my code at the moment that I want to tackle first :sweat_smile:
My guess is that this function https://github.com/Hasnep/brainroc/blob/f54d9ee71332e36ea4342abb2054bc80fb0aa484/src/Interpreter.roc#L56-L77
is not tail recursive so roc compiler cannot apply the optimization. But I didn't check it
it should be because we desugar |>
but I'd need to check it too. Another potential problem could be if the input program list is not unique (and hence cloned a bunch)
but would desugared call bubble up above the when
?
after desugaring the |>
, the recursive call is in tail position, right?
ah, yes, my bad
Do you have a link to the original source as well?
Also, when I get the chance, I have some ideas, but want to verify them first.
Logan Lowder has marked this topic as resolved.
Logan Lowder has marked this topic as unresolved.
Whoops miss-click, sorry
Here's the script I used to benchmark the interpreters, run from the root of the brainroc
repo:
# Download benchmark script
curl -C - https://raw.githubusercontent.com/fabianishere/brainfuck/master/examples/bench/bench-1.bf --output examples/bench/bench-1.bf --create-dirs
# Build Go interpreter
git clone https://github.com/kgabis/brainfuck-go.git /tmp/bf-go
pushd /tmp/bf-go
go build bf.go
popd
cp /tmp/bf-go/bf ./bf-go
# Build Roc interpreter
roc build --optimize src/main.roc
# Benchmark the two interpreters
SCRIPT=examples/bench/bench-1.bf hyperfine "./bf-go $SCRIPT" "./src/brainroc $SCRIPT"
And this is the output I get:
Benchmark #1: ./bf-go examples/bench/bench-1.bf
Time (mean ± σ): 1.2 ms ± 0.2 ms [User: 1.1 ms, System: 0.4 ms]
Range (min … max): 0.8 ms … 2.1 ms 1378 runs
Warning: Command took less than 5 ms to complete. Results might be inaccurate.
Benchmark #2: ./src/brainroc examples/bench/bench-1.bf
Time (mean ± σ): 11.128 s ± 0.040 s [User: 11.123 s, System: 0.002 s]
Range (min … max): 11.073 s … 11.182 s 10 runs
Summary
'./bf-go examples/bench/bench-1.bf' ran
9343.95 ± 1669.68 times faster than './src/brainroc examples/bench/bench-1.bf'
So fundamentally, the app is spending all its time in memcpy.
So definitely a uniqueness issue. There is something you would hope never gets copied, but it is getting copied (probably every iteration of run)
Haven't figured out the root cause of the coping yet.
Thank you @Brendan Hansknecht, that's a very useful place for me to start finding some performance improvements
Hopefully it helps. I tried a few things I thought would help this morning and got quite confused when none of them worked.
Won't have time this weekend to look at it more
I'm looking into this a bit now. Bunch of things going on. First of all, combining all of the state into one record does not play well with our RC mechanism.
inlining the helpers and splitting out some fields brings me down to ~ 4s. Still terrible, but a bit better
record updates (when some fields are reference-counted) are still a problem
by working around some increments/decrements with hacks (exposing unsafe functions) I can get it down to 1s ish. Still terrible though
@Brendan Hansknecht how did you figure out that the majority of the time is spent in memcpy? I still suspect that that is going on, but I don't see it in my perf output, and also we're not clearly cloning very long lists (using the zig code we have for that, anyway)
So I am not 100% sure I was correct cause in release build with the legacy link the stack got cut off too early for me to be fully sure. With debug build and legacy linker, profile seemed similar but the stack would go deeper. In that case, almost all time at the deepest stack level was in the new Facebook memcpy assembly code.
So not complete convinced or sure but seemed reasonable. That said, given none of my optimization tests really worked, I am definitely confused and unsure.
ah, I guess I'm working with a platform without debug info, because I see none of that
I think I had to update the platform to get it to compile on my Linux machine.
Also, had to change the app not to use the args module cause it wasn't working on my Linux machine for some reason
I made this change https://github.com/roc-lang/roc/pull/5667. I don't think it really helps here, but I noticed this missed opportunity while looking at the IR
well I mean it brings down memory consumption, but for the runtime it's not a huge improvement
Wow, thanks for looking into this @Folkert de Vries and @Brendan Hansknecht :) I'm not able to contribute directly to the compiler much, so I'm glad my bad code can help improve it indirectly :sweat_smile:
I also had trouble with the argument parsing on Linux, I think the legacy linker messes up the commandline arguments passed to the program so that the binary isn't the first argument.
I started re-writing the interpreter to try to not pass around such a big struct all the time, if I get it working I'll update this thread with the benchmark
Hannes said:
I also had trouble with the argument parsing on Linux, I think the legacy linker messes up the commandline arguments passed to the program so that the binary isn't the first argument.
This is a known issue, but I haven't been able to investigate it deeply yet
See https://github.com/roc-lang/basic-cli/issues/82
I am taking a look at this, and one thing that is significantly different between the two interpreters is that the go version does not use a recursive datatype for representing the ast nodes, whereas the Roc version does (https://github.com/Hasnep/brainroc/blob/f54d9ee71332e36ea4342abb2054bc80fb0aa484/src/Parser.roc#L3-L10). I suspect if you can eliminate this, the runtime will drop significantly, since the recursive structure forces a ton of allocations (even with Folkert's change)
posting some other observations in #compiler development (public) > bf interpreter analysis
here is a more direct port of the go code https://gist.github.com/folkertdev/ef1970afe9f00048f163f4d71d5f03f7. it however still takes 4 seconds on my machine
replacing the lookup with a version of get that does not perform inc/dec halves the execution time
if pc < List.len instructions then
{ operator, operand } = List.getUnsafe instructions pc
replacing the other List.get
operations with their unsafe counterparts halves execution time again.
hm, but go does bounds checks too, right? :thinking:
sure, but not inc/dec
got it down to 400ms with more unsafe
this might be a good place to try an idea I once had that seems silly but maybe might help: List.getWrap : List a, Nat -> Result a [ListWasEmpty]
- same as List.get
except:
this does not help
the hypothesis is that maybe the length = 0 branch is easier for llvm to optimize away in some scenarios than the comparson to a variable #
oh you already tried something similar?
the problem is refcount operations, not the bounds check
oh I was responding to this part:
Folkert de Vries said:
replacing the other
List.get
operations with their unsafe counterparts halves execution time again.
I assumed that was bounds check, but do the unsafe ones also not do RC?
yes that is the crucial difference. getUnsafe
does not do any inc/dec operations on the list
I made a second version that uses tail calls instead of a switch https://gist.github.com/folkertdev/3212e3ed7ac124d4b6e9db920cf05099. Sadly it runs into this error in mono
thread '<unnamed>' panicked at 'function symbol `#UserApp.incrementPointer` not in set LambdaSet { set: [], args: [InLayout(U64), InLayout(48), InLayout(U64), InLayout(LIST_U8), InLayout(U64), InLayout(LIST_U8)], ret: InLayout(LIST_U8), representation: InLayout(VOID), full_layout: InLayout(62) }', crates/compiler/mono/src/layout.rs:1576:9
(cc @Ayaz Hafiz )
hmm, this does not work anyway because llvm tail calls are lies
what is extremely disappointing is that this LLVM
define internal fastcc %list.RocList @"#UserApp_incrementPointer_3bbacd33228bca14fe5573efe7278cde33c78fe9028ba98810cff368dece"(i64 %pc, %list.RocList %instructions, i64 %dataPtr, %list.RocList %data, i64 %"108", %list.RocList %output) {
entry:
%call = tail call fastcc i64 @Num_addWrap_c852b6d75d2364d70d094699f8a9cda9129d5310ed82ea45564f47a9(i64 %pc, i64 1)
%call1 = tail call fastcc i64 @Num_addWrap_c852b6d75d2364d70d094699f8a9cda9129d5310ed82ea45564f47a9(i64 %dataPtr, i64 1)
%call2 = tail call fastcc %list.RocList @"#UserApp_next_4afec3b1b615e34b46f852dc4576722a03d82d96cc27deb38d7b350ecaf31"(i64 %call, %list.RocList %instructions, i64 %call1, %list.RocList %data, %list.RocList %output)
ret %list.RocList %call2
}
does not result in an actual tail call
000000000011f110 <#UserApp_incrementPointer_3bbacd33228bca14fe5573efe7278cde33c78fe9028ba98810cff368dece>:
11f110: 55 push %rbp
11f111: 41 57 push %r15
...
11f16a: ff 74 24 68 pushq 0x68(%rsp)
11f16e: e8 8d 0b 00 00 callq 11fd00 <#UserApp_next_4afec3b1b615e34b46f852dc4576722a03d82d96cc27deb38d7b350ecaf31>
11f173: 48 83 c4 38 add $0x38,%rsp
11f177: 5b pop %rbx
11f178: 41 5c pop %r12
11f17a: 41 5d pop %r13
11f17c: 41 5e pop %r14
11f17e: 41 5f pop %r15
11f180: 5d pop %rbp
11f181: c3 retq
11f182: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
11f189: 00 00 00
11f18c: 0f 1f 40 00 nopl 0x0(%rax)
In clang you can force tail calls with musttail
. We may need something equivalent to force LLVM to apply the optimization here
llvm has the musttail
attribute, but in mono we can't really know whether applying it is correct.
Oh, we're doing recursive list refcount updates to all of the nested AST notes inside of blocks?
in the original code, yes. Not any more in the version that I'm talking about (which without cheats still takes 4 seconds)
wow with optimizations LLVM messes this up completely. it does a bunch of inlining, but cannot turn it into a loop, instead just creating a recursive function that blows the stack very quickly
@Folkert de Vries What happens if you try to make the tail call extremely obvious to LLVM by making the instruction functions State -> State
functions (where State
== (Nat, List Instruction, Nat, List U8, List U8)
) and turning next
into something along the lines of
run : State -> List U8
run = \(pc, instructions, dataPtr, data, output) ->
if pc < List.len instructions then
nextInstruction = List.getUnsafe instructions pc
f = lookup nextInstruction.operator
next_state = f pc instructions dataPtr data nextInstruction.operand output
run next_state
else
output
?
combining the state into one structure is generally not better
Here's the IR for Folkert's version
For the go version
optimizations go performs that we don't
./bf.go:51:17: input does not escape
./bf.go:53:19: make([]uint16, 0) does not escape
./bf.go:73:27: &errors.errorString{...} escapes to heap
./bf.go:85:25: &errors.errorString{...} escapes to heap
./bf.go:90:17: program does not escape
./bf.go:91:14: make([]int16, data_size) escapes to heap
./bf.go:93:27: new(bufio.Reader) does not escape
./bf.go:93:27: make([]byte, bufio.size) escapes to heap
i.e. it is able to allocate some structures on the stack?
yeah. For program and input, if we had that, presumably we could eliminate all RC
also they definitely hoist the bounds check
well you could eliminate RC, but you'd also need to know that the data always says unique
which morphic might be able to tell us statically sometimes, but in general the refcount is also crucial for us to allow in-place updates
if it doesn't escape in the function we don't need to bump refcounts on each loop
we do. What if each loop iteration appends the value to some list
(I know it's not needed in this example, but in general RC is not just about memory safety in our case)
likewise, to hoist the bounds check, we'd need to know that the list length is a constant
well, that is what i'm saying about escape analysis
my assumption was that go performs escape analysis on a function level
you'd need to do it on the scope level here, and even then you'd need to check that there is no mutation I think?
well in order to see it on a function level, we must also see it on a per-scope level?
anyway maybe this is a red herring if we are still lower after removing refcounts
yeah, I got that version down to ~400ms. In the original benchmarks go was at 1.2ms. (I did not run that locally though)
does anyone have an idea for how the musttail
attribute is actually added? https://llvm.org/docs/LangRef.html#id328. there is a special fucntion in the LLVM C api to add a normal tail call, but musttail
does not really seem to be supported there
there is also the tailcc calling convention, which may or may not be relevant
lol wth, so there is tailcc
which will perform tail calls when a call is in tail position. But llvm "optimizes" the code in such a way that my syntactic tail calls are no longer tail calls
and this, I think, is why I need to use musttail
, but there does not seem to be a way to do it ...
hold up
are we sure this go bf interpreter is correct?
when I run it with @Hannes 's original sample program, https://raw.githubusercontent.com/fabianishere/brainfuck/master/examples/bench/bench-1.bf, it seems to spit out incorrect values (the null byte and then byte 13, rather than OK as a correct interpreter would)
when I try it with the hello world program
>++++++++[<+++++++++>-]<.>++++[<+++++++>-]<+.+++++++..+++.>>++++++[<+++++++>-]<+
+.------------.>++++++[<+++++++++>-]<+.<.+++.------.--------.>>>++++[<++++++++>-
]<+.
inlined into both programs here are the results
❯ hyperfine ./brainfuck-go/bf ./brainroc
Benchmark 1: ./brainfuck-go/bf
Time (mean ± σ): 7.1 ms ± 0.9 ms [User: 2.9 ms, System: 2.0 ms]
Range (min … max): 5.6 ms … 9.8 ms 253 runs
Benchmark 2: ./brainroc
Time (mean ± σ): 4.3 ms ± 0.7 ms [User: 2.4 ms, System: 1.0 ms]
Range (min … max): 3.3 ms … 8.1 ms 293 runs
Wow @Ayaz Hafiz you're right! I tried the Go interpreter with the hello world program and some others like serpinski's triangle and it worked fine for all of them, but when I ran that benchmark script hyperfine supressed the output, so I didn't see that it was producing incorrect output!
Here's some benchmarks that do actually produce the same output
Hello world:
COMMAND │ MEAN [MS] │ MIN [MS] │ MAX [MS] │ RELATIVE
───────────────────────────────────┼───────────┼──────────┼──────────┼──────────────
./bf-go examples/hello.bf │ 1.2 ± 0.1 │ 0.8 │ 1.8 │ 1.00
./src/brainroc examples/hello.bf │ 1.3 ± 0.1 │ 1.2 │ 2.5 │ 1.12 ± 0.15
Serpinski's triangle:
COMMAND │ MEAN [MS] │ MIN [MS] │ MAX [MS] │ RELATIVE
─────────────────────────────────┼───────────┼──────────┼──────────┼──────────────
./bf-go triangle.bf │ 3.0 ± 0.3 │ 2.7 │ 4.7 │ 1.00
./src/brainroc triangle.bf │ 9.1 ± 0.3 │ 8.7 │ 10.2 │ 3.07 ± 0.29
Benchmark 2:
COMMAND │ MEAN [S] │ MIN [S] │ MAX [S] │ RELATIVE
────────────────────────────┼────────────────┼─────────┼─────────┼───────────────
./bf-go bench-2.bf │ 1.364 ± 0.026 │ 1.348 │ 1.413 │ 1.00
./src/brainroc bench-2.bf │ 54.819 ± 0.135 │ 54.579 │ 55.057 │ 40.18 ± 0.76
You may want to compare to another interpreter that works in all cases. Even if these produce the same results, the roc version may be doing more work/tracking more things. So the go version may have a structural advantage that leads to much faster execution. If it was handling whatever case it is missing that fails bench 1, it might slow down all other cases (not sure of this, but it is a pretty common benchmarking issue)
Good point, I picked this Go implementation because it's the first straightforward implementation I found which is more comparable to mine than some of the bigger ones. I'll try a few and see how they compare, making sure to check the output is correct this time! :sweat_smile:
alright, interesting. Here is my time for my "vanilla" interpreter for the bench-2
benchmark
Executed in 16.75 secs fish external
usr time 16.75 secs 598.00 micros 16.75 secs
sys time 0.00 secs 51.00 micros 0.00 secs
and the time with unsafe cheats
Executed in 1.27 secs fish external
usr time 1271.84 millis 571.00 micros 1271.27 millis
sys time 0.07 millis 66.00 micros 0.00 millis
so that would mean that without the overhead of the list operations, we're in the same order of magnitude at least
I wrote my own version of a BF interpreter.
On one hand to dust off my own Roc skills since it's been a while I programmed in Roc.
On the other because I've become a liiitle bit nerdsniped by the issue we are facing :sweat_smile:
https://github.com/Qqwy/roc-bf_example
This implementation splits up the interpreter in
runOne
which performs a single instruction's interpreter State -> State transformation.run
which wraps it and performs the desired recursion to run a program to completion.(And besides this, it is very similar to Folkert's vanilla implementation, with no unsafe cheats)
It seemed like a good idea to be able to look at these two parts in isolation.
However... it results in a program that segfaults on larger BF programs (?!)
I've tried adding dbg
statements, so the intermediate state
after each executed instruction can be seen.
Oddly, running roc dev
stops the program after 249 dbg
lines are printed.
I'm very confused by this. It seems like the crash during dbg
happens much earlier than the segfault if you don't run in dev
mode.
How to continue debugging something like this?
Could it be the case that here too that something that is a tailcall in the source does not actually result in a tailcall in the binary
?
I have definitely seen the dbg issues where after x dbgs, crashes
Tends to vary a bit by system, but no idea why it happens. Dbg is over shared memory with the compiler, so we probably have a bug there
Brendan Hansknecht said:
I have definitely seen the dbg issues where after x dbgs, crashes
I'm running it in lldb
now. Seems like a call to roc dev ./brainroc ./examples/some_program.bf
just exits with status code 0 after 249 iterations :confused:
Running the program compiled using roc build
in lldb shows that the segfault happens inside malloc
so it might be an out-of-memory issue maybe? :thinking:
Output from lldb
:
For bottles.bt
:
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x16f603ff0)
frame #0: 0x000000010000d6f4 brainroc`decrement_refcounted_ptr_8
brainroc`decrement_refcounted_ptr_8:
-> 0x10000d6f4 <+0>: stp x29, x30, [sp, #-0x10]!
0x10000d6f8 <+4>: ldr x8, [x0]
0x10000d6fc <+8>: cbz x8, 0x10000d71c ; <+40>
0x10000d700 <+12>: mov x10, #-0x8000000000000000
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x16f603ff0)
* frame #0: 0x000000010000d6f4 brainroc`decrement_refcounted_ptr_8
frame #1: 0x000000010000d8a0 brainroc`___lldb_unnamed_symbol333506 + 40
frame #2: 0x000000010000a638 brainroc`List_get_53eef38977ca9e3af29e8b6fc9f50f557be9bbd173abd2118eb5488f19fb2 + 100
frame #3: 0x00000001000095b4 brainroc`List_update_82d12b6b12b8099eecd17c66fbb6ef41a8137f8c1ef74478e7bee978be8f9 + 64
frame #4: 0x000000010000a6e8 brainroc`#UserApp_dec_ca98df76e744faeef21fb76918295997c9c1b552a2e623d6fc162a11de8fae + 84
frame #5: 0x0000000100007144 brainroc`#UserApp_runOne_5250a04b6d2518751b896097c7387fb10ff97b95f3e08a2795e163ade21ee + 240
frame #6: 0x0000000100009f48 brainroc`#UserApp_run_618af7379dc962a8e685497132e67d82737289278bf8d7fadc9a49f2d03ce7 + 44
frame #7: 0x0000000100003744 brainroc`#UserApp_36_789661f33c6ea1791479ecf1f52dd93e21b779364a5197d9de3459113903b9c + 68
frame #8: 0x0000000100004380 brainroc`Task_39_5cbbff1635f59ae21a02af6cfe0157283a05fb77d9b6ef4377a9133a78ffbe5 + 164
frame #9: 0x00000001000043d8 brainroc`Effect_effect_after_inner_935229af12e1c6ee752a73f7b73add5a7c7a22cfba9e577e778e240ed627a + 72
frame #10: 0x000000010000b944 brainroc`Effect_effect_after_inner_1b6b9e2f2c8025d6941d1d79426973c1ba899598ef8eecc9bea3f5f3657b4477 + 84
frame #11: 0x000000010000e2e8 brainroc`roc__mainForHost_0_caller + 12
frame #12: 0x0000000100034d60 brainroc`host::call_the_closure::h2bc8f279b028add5 + 160
frame #13: 0x0000000100034c9c brainroc`rust_main + 132
frame #14: 0x000000010000f4d4 brainroc`main + 20
frame #15: 0x00000001a79fff28 dyld`start + 2236
For sierpinski.bt
, bench1.bt
, bench2.bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x16f603ff0)
frame #0: 0x00000001a7b946b8 libsystem_malloc.dylib`nanov2_malloc_zero_on_alloc + 4
libsystem_malloc.dylib`nanov2_malloc_zero_on_alloc:
-> 0x1a7b946b8 <+4>: stp x20, x19, [sp, #-0x20]!
0x1a7b946bc <+8>: stp x29, x30, [sp, #0x10]
0x1a7b946c0 <+12>: add x29, sp, #0x10
0x1a7b946c4 <+16>: add x8, x1, #0xf
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x16f603ff0)
* frame #0: 0x00000001a7b946b8 libsystem_malloc.dylib`nanov2_malloc_zero_on_alloc + 4
frame #1: 0x0000000100034134 brainroc`roc_alloc + 24
frame #2: 0x0000000100009c40 brainroc`List_replaceUnsafe_8256d790c99390129cd6628d4d43bc44f55cfb83af722d8248666b192be24d65 + 176
frame #3: 0x0000000100003584 brainroc`List_update_9aea80b84c2f1cf95fb346d143d7e8d50c7bf2b25952d5f8f5db3c06a669b1a + 104
frame #4: 0x0000000100009a84 brainroc`#UserApp_inc_ab6bf482529beb271842e34a3922217532b2f57565d72aea510e9edbdcef6 + 84
frame #5: 0x0000000100007158 brainroc`#UserApp_runOne_5250a04b6d2518751b896097c7387fb10ff97b95f3e08a2795e163ade21ee + 260
frame #6: 0x0000000100009f48 brainroc`#UserApp_run_618af7379dc962a8e685497132e67d82737289278bf8d7fadc9a49f2d03ce7 + 44
frame #7: 0x0000000100003744 brainroc`#UserApp_36_789661f33c6ea1791479ecf1f52dd93e21b779364a5197d9de3459113903b9c + 68
frame #8: 0x0000000100004380 brainroc`Task_39_5cbbff1635f59ae21a02af6cfe0157283a05fb77d9b6ef4377a9133a78ffbe5 + 164
frame #9: 0x00000001000043d8 brainroc`Effect_effect_after_inner_935229af12e1c6ee752a73f7b73add5a7c7a22cfba9e577e778e240ed627a + 72
frame #10: 0x000000010000b944 brainroc`Effect_effect_after_inner_1b6b9e2f2c8025d6941d1d79426973c1ba899598ef8eecc9bea3f5f3657b4477 + 84
frame #11: 0x000000010000e2e8 brainroc`roc__mainForHost_0_caller + 12
frame #12: 0x0000000100034d60 brainroc`host::call_the_closure::h2bc8f279b028add5 + 160
frame #13: 0x0000000100034c9c brainroc`rust_main + 132
frame #14: 0x000000010000f4d4 brainroc`main + 20
frame #15: 0x00000001a79fff28 dyld`start + 2236
(helloworld.bt
and print0to999.bt
run fine).
in the platform, you can make roc_alloc
print e.g. how many bytes it wants to allocate. That might be helpful
the problem here is the dbg
btw, or does this always happen?
also something I've found often helpful is to put lldb backtraces and/or .ll bytecode into gpt4 and say what symptom I'm seeing and ask if it sees anything - I've had good success with that in the past!
Folkert de Vries said:
the problem here is the
dbg
btw, or does this always happen?
This is the output you get after running a program compiled using a normal roc build
which IIRC strips all dbg
calls.
(To be precise, I run cargo run -- build --debug ../bf/main.roc
; the roc
directory containing the roc repo and the bf
directory containing the bf interpreter project are sibling directories)
Folkert de Vries said:
in the platform, you can make
roc_alloc
print e.g. how many bytes it wants to allocate. That might be helpful
It looks like this:
# ... many more similar lines here
Allocating 1008...
Ptr: 0x102904440
Deallocating 0x102904830
Allocating 1008...
Ptr: 0x102904830
Deallocating 0x102904440
Allocating 1008...
Ptr: 0x102904440
Deallocating 0x102904830
Allocating 1008...
Ptr: 0x102904830
Deallocating 0x102904440
Allocating 1008...
Ptr: 0x102904440
Deallocating 0x102904830
Allocating 1008...
Ptr: 0x102904830
Deallocating 0x102904440
Allocating 1008...
Ptr: 0x102904440
Deallocating 0x102904830
Allocating 1008...
Ptr: 0x102904830
Deallocating 0x102904440
Allocating 930...
Ptr: 0x102904440
Deallocating 0x102904f70
Allocating 1008...
Ptr: 0x102904c20
Deallocating 0x102904830
Allocating 1008...
Ptr: 0x1029047f0
Deallocating 0x102904c20
Allocating 1008...
Ptr: 0x102904be0
Deallocating 0x1029047f0
Allocating 1008...
Ptr: 0x1029047f0
Deallocating 0x102904be0
Allocating 1008...
Ptr: 0x102904be0
Deallocating 0x1029047f0
Allocating 1008...
Ptr: 0x1029047f0
Deallocating 0x102904be0
Allocating 1008...
Ptr: 0x102904be0
Deallocating 0x1029047f0
Allocating 1008...
Ptr: 0x1029047f0
Deallocating 0x102904be0
Allocating 1008...
Ptr: 0x102904be0
Deallocating 0x1029047f0
Allocating 1008...
Ptr: 0x1029047f0
Deallocating 0x102904be0
Allocating 1008...
Ptr: 0x102904be0
Deallocating 0x1029047f0
Allocating 1008...
Ptr: 0x1029047f0
Deallocating 0x102904be0
Allocating 1008...
Ptr: 0x102904be0
Deallocating 0x1029047f0
Allocating 1008...
Ptr: 0x1029047f0
Deallocating 0x102904be0
Allocating 1008...
Ptr: 0x102904be0
Deallocating 0x1029047f0
Allocating 1008...
Ptr: 0x1029047f0
Deallocating 0x102904be0
Allocating 1008...
Ptr: 0x102904be0
Deallocating 0x1029047f0
Allocating 1008...
Ptr: 0x1029047f0
Deallocating 0x102904be0
Allocating 1008...
Ptr: 0x102904be0
Deallocating 0x1029047f0
Allocating 1008...
Ptr: 0x1029047f0
Deallocating 0x102904be0
Allocating 1008...
Ptr: 0x102904be0
Deallocating 0x1029047f0
Allocating 1008...
Ptr: 0x1029047f0
Deallocating 0x102904be0
Allocating 1008...
Ptr: 0x102904be0
Deallocating 0x1029047f0
Allocating 1008...
Ptr: 0x1029047f0
Deallocating 0x102904be0
Allocating 1008...
Ptr: 0x102904be0
Deallocating 0x1029047f0
Allocating 1008...
Ptr: 0x1029047f0
Deallocating 0x102904be0
Allocating 1008...
Ptr: 0x102904be0
Deallocating 0x1029047f0
Allocating 1008...
Ptr: 0x1029047f0
Deallocating 0x102904be0
Allocating 1008...
Ptr: 0x102904be0
Deallocating 0x1029047f0
Allocating 1008...
Process 41816 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x16f603fd0)
frame #0: 0x00000001008c1f8c brainroc`_$LT$std..io..stdio..StdoutLock$u20$as$u20$std..io..Write$GT$::write_all::hd86c4a41d5107153 at stdio.rs:735 [opt]
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x16f603fd0)
* frame #0: 0x00000001008c1f8c brainroc`_$LT$std..io..stdio..StdoutLock$u20$as$u20$std..io..Write$GT$::write_all::hd86c4a41d5107153 at stdio.rs:735 [opt]
frame #1: 0x00000001008c4af8 brainroc`_$LT$std..io..Write..write_fmt..Adapter$LT$T$GT$$u20$as$u20$core..fmt..Write$GT$::write_str::h968df0371a4b9768 at mod.rs:1671:23 [opt]
frame #2: 0x00000001009197a0 brainroc`core::fmt::Formatter::pad_integral::hb214d8a54e5b3e5f at mod.rs:0 [opt]
frame #3: 0x000000010091bb00 brainroc`core::fmt::pointer_fmt_inner::hb0b335e6a5beaa1d [inlined] core::fmt::num::GenericRadix::fmt_int::hd198a0fe668a9d31 at num.rs:110:9 [opt]
frame #4: 0x000000010091ba98 brainroc`core::fmt::pointer_fmt_inner::hb0b335e6a5beaa1d [inlined] core::fmt::num::_$LT$impl$u20$core..fmt..LowerHex$u20$for$u20$usize$GT$::fmt::h338abe56eabf0dc0 at num.rs:155:17 [opt]
frame #5: 0x000000010091ba98 brainroc`core::fmt::pointer_fmt_inner::hb0b335e6a5beaa1d at mod.rs:2503:15 [opt]
frame #6: 0x0000000100822a88 brainroc`_$LT$$BP$const$u20$T$u20$as$u20$core..fmt..Pointer$GT$::fmt::h88e433855bb7f807(self=0x000000016f6041b8, f=0x000000016f604200) at mod.rs:2475:9
frame #7: 0x0000000100822a48 brainroc`_$LT$$BP$mut$u20$T$u20$as$u20$core..fmt..Pointer$GT$::fmt::heda03ebd385f5e29(self=0x000000016f6043f0, f=0x000000016f604200) at mod.rs:2514:9
frame #8: 0x0000000100822a0c brainroc`_$LT$$BP$mut$u20$T$u20$as$u20$core..fmt..Debug$GT$::fmt::hc2e87981d6756b50(self=0x000000016f6043f0, f=0x000000016f604200) at mod.rs:2543:9
frame #9: 0x0000000100919434 brainroc`core::fmt::write::hd33da007f7a27e39 at mod.rs:1208:17 [opt]
frame #10: 0x00000001008c1ab8 brainroc`_$LT$$RF$std..io..stdio..Stdout$u20$as$u20$std..io..Write$GT$::write_fmt::h881b8d60b2402de3 [inlined] std::io::Write::write_fmt::hc2f28f6e637fac24 at mod.rs:1682:15 [opt]
frame #11: 0x00000001008c1a90 brainroc`_$LT$$RF$std..io..stdio..Stdout$u20$as$u20$std..io..Write$GT$::write_fmt::h881b8d60b2402de3 at stdio.rs:716:9 [opt]
frame #12: 0x00000001008c38dc brainroc`std::io::stdio::_print::h8769d5b8d07b47a7 [inlined] _$LT$std..io..stdio..Stdout$u20$as$u20$std..io..Write$GT$::write_fmt::h687f60646da13c0c at stdio.rs:690:9 [opt]
frame #13: 0x00000001008c38b8 brainroc`std::io::stdio::_print::h8769d5b8d07b47a7 [inlined] std::io::stdio::print_to::h68985b56676cff12 at stdio.rs:1008:21 [opt]
frame #14: 0x00000001008c387c brainroc`std::io::stdio::_print::h8769d5b8d07b47a7 at stdio.rs:1075:5 [opt]
frame #15: 0x00000001000265d8 brainroc`roc_alloc(size=1008, _alignment=8) at lib.rs:43:5
frame #16: 0x000000010000ae14 brainroc`List_replaceUnsafe_dbbb614026929029a924a622e5a645206e5e1277bd8c25cb7b78527df1a8c + 176
frame #17: 0x000000010000a780 brainroc`List_update_b7f585fd6f82cd9d725cedb302bc1875b5e7bb83a5ac59033a7d4152c1447 + 104
frame #18: 0x000000010000b874 brainroc`#UserApp_dec_ca98df76e744faeef21fb76918295997c9c1b552a2e623d6fc162a11de8fae + 84
frame #19: 0x000000010000830c brainroc`#UserApp_runOne_5250a04b6d2518751b896097c7387fb10ff97b95f3e08a2795e163ade21ee + 240
frame #20: 0x000000010000b11c brainroc`#UserApp_run_618af7379dc962a8e685497132e67d82737289278bf8d7fadc9a49f2d03ce7 + 44
frame #21: 0x00000001000047c4 brainroc`#UserApp_36_66a53c5bb7482a7975e058b703753a62aee74dcb3ed8c2fbfc227399ea738e + 68
frame #22: 0x0000000100005400 brainroc`Task_39_f2eb2e65858ef9a081c444e7b9b2cef1ed51b5a1e38027833034b9f057aa3131 + 164
frame #23: 0x0000000100007e74 brainroc`Effect_effect_after_inner_3bfae27b50cc70419dec89ef8da341b1287d7bb7b3c4bb2481ba28b17a8ec4 + 72
frame #24: 0x000000010000cd98 brainroc`Effect_effect_after_inner_4f1fdebc72623b7987dfbf57d7b81537b885c9e4ce317a63dbf262856adf + 84
frame #25: 0x000000010000f3d4 brainroc`roc__mainForHost_0_caller + 12
frame #26: 0x0000000100027198 brainroc`host::call_the_closure::h337ad65c516ada9b(closure_data_ptr="A BF interpreter.") at lib.rs:234:5
frame #27: 0x0000000100027110 brainroc`rust_main at lib.rs:223:9
frame #28: 0x0000000100010554 brainroc`main + 20
frame #29: 0x00000001a79fff28 dyld`start + 2236
(lldb)
The size of the allocations is all as large as the dataSize
(+ 8 for the refcount I assume). So I think the program is not changing the data list in place.
However, what is strange is that it looks to me like these allocations are all short-lived. And the error we get is not an OOM error but a 'BAD ACCESS' error
Counting how many allocations are alive by keeping track in a local static variable, shows that there are only 4
allocs alive during the main inner loop (and during the time of the segfault)
yeah, I'm not sure how alloc could cause a bad access
so I wonder if maybe the real cause is something else
the tool to find the actual issue is valgrind, which does not run on mac sadly
(how do mac developers get anything done???)
you can try using macos's debug malloc to tell you if it's actually a badaccess
DYLD_INSERT_LIBRARIES=/usr/lib/libgmalloc.dylib
Last updated: Jul 06 2025 at 12:14 UTC