Stream: beginners

Topic: Segfault with long running program


view this post on Zulip Nick Hallstrom (Feb 27 2023 at 10:25):

I have been running the Roc Clock as an actual clock these last few days. The problem is that it consistently seg faults after running for about 8 hours. I’m guessing this is going to be a hard issue to debug 😅
Seg Faults

view this post on Zulip Nick Hallstrom (Feb 27 2023 at 10:26):

Here’s the code for it if that’s useful
https://github.com/Billzabob/roc-clock/blob/main/examples/clock.roc

view this post on Zulip Anton (Feb 27 2023 at 10:46):

Valgrind might be able to shed more light on the cause, can you execute:

valgrind --leak-check=full --track-origins=yes ./examples/clock

view this post on Zulip Anton (Feb 27 2023 at 17:58):

Letting the clock run a lot faster could also trigger the segfault much sooner...

view this post on Zulip Nick Hallstrom (Feb 27 2023 at 18:13):

Ya that's a good point. I'm not home right now though, so I'll just let it run it's course today. But if I need to redo this I'll just disconnect the PI from the clock hardware and just run it without all the sleeps.

view this post on Zulip Nick Hallstrom (Feb 28 2023 at 04:04):

Is this useful at all?
Valgrind

view this post on Zulip Brendan Hansknecht (Feb 28 2023 at 04:30):

haha 12 GB allocated over the course of the application.

view this post on Zulip Brendan Hansknecht (Feb 28 2023 at 04:32):

Looks to be str.graphemes that is leaking memory, so that is useful to know.

view this post on Zulip Nick Hallstrom (Feb 28 2023 at 04:33):

Guess I just need to download some more RAM then

view this post on Zulip Brendan Hansknecht (Feb 28 2023 at 04:36):

The question is why isn't the memory being freed. Could just be a bug in str.graphemes or could be some other bug in the application that is keeping around the data generated from str.graphemes. I'll try to take a look and see what's going on.

view this post on Zulip Brendan Hansknecht (Feb 28 2023 at 04:39):

I have a fuzz test for Str.graphemes (that has never run on arm), but at least on x86_64, it seems to pass consistently. Not definitive of anything, but suggests that the bug if probably not just directly in Str.graphemes.

view this post on Zulip Brendan Hansknecht (Feb 28 2023 at 04:44):

Looking at the code, I bet the bug is in Task.loop, or more specifically Effect.loop. I bet that it is not correctly decrementing the refcount of the looped over data between iterations.

view this post on Zulip Brendan Hansknecht (Feb 28 2023 at 04:44):

I am gonna try to make a small repro and see.

view this post on Zulip Brendan Hansknecht (Feb 28 2023 at 05:05):

Hmmm, unless this issue is arm specific, my guess seems wrong. Somehow numbers is being shared around in a way such that roc never manages to decrement it's refcount to zero, so it is never freed. Maybe another function is missing a decrement.

view this post on Zulip Brendan Hansknecht (Feb 28 2023 at 05:09):

As a simple validity check, can you do two tests for me on you pi:

  1. numbers = Str.graphemes n -> numbers = lastNumbers. I am pretty sure that this shouldn't crash anytime soon/shouldn't accumulate a ton of memory.

  2. numbers = Str.graphemes n -> numbers = [ "8", "8", "8", "8" ]. If I am correct, this should still lead to the memory leak and then crash.

view this post on Zulip Nick Hallstrom (Feb 28 2023 at 20:11):

It seems like your hypothesis is correct! #2 crashed after a short while and #1 has been running for a while without any issues

view this post on Zulip Brendan Hansknecht (Feb 28 2023 at 20:48):

Ok. Good to know

view this post on Zulip Brendan Hansknecht (Feb 28 2023 at 21:15):

Next i guess i should look at the ir with recounting and see anything missing, otherwise, need to dig into other functions that numbers is passed to. Some built-in likely is simply not correctly handling refcounts.

view this post on Zulip Brendan Hansknecht (Feb 28 2023 at 22:09):

Ok. I think I found the issue. Looks like List.map2 and List.map3 have memory leak.

view this post on Zulip Brendan Hansknecht (Feb 28 2023 at 22:12):

minimal repro: #5079

view this post on Zulip Nick Hallstrom (Feb 28 2023 at 22:29):

So I think I was wrong, looks like 1 and 2 above both eventually segfaulted. Does that make sense for the List.mapN thing?

view this post on Zulip Brendan Hansknecht (Feb 28 2023 at 22:31):

I think there may be more than one memory leak, just at different speeds. does 1 take longer to segfault?

view this post on Zulip Nick Hallstrom (Feb 28 2023 at 23:36):

Yes seemed like the first one took much longer to seg fault

view this post on Zulip Nick Hallstrom (Mar 01 2023 at 04:42):

Trying to familiarize myself with the codebase. Am I looking in the right area here? I'm guessing Dec and IncN refer to the reference count?

view this post on Zulip Brendan Hansknecht (Mar 01 2023 at 06:09):

Yep and yep

view this post on Zulip Brendan Hansknecht (Mar 01 2023 at 06:10):

In this case my guess is something weird is happening with the 2nd to nth list that cause roc to not decrement their refcount after the function. We have a whole system around wether variables are passed to built-ins as owned or borrowed. I am guessing that is the root cause of the bug.

view this post on Zulip Brendan Hansknecht (Mar 01 2023 at 06:11):

Though i don't really understand the fundamentals of how that system works/the underlying details.

view this post on Zulip Brendan Hansknecht (Mar 01 2023 at 06:12):

Could also be a bug local to those functions directly, but at first glance (and a little messing around), i didn't think so.

view this post on Zulip Brendan Hansknecht (Mar 01 2023 at 06:16):

Ownership definition defined here: https://github.com/roc-lang/roc/blob/main/crates/compiler/mono/src/borrow.rs#L1009

view this post on Zulip Richard Feldman (Mar 01 2023 at 12:20):

is there a way we could rewrite those functions to be in pure Roc?

view this post on Zulip Richard Feldman (Mar 01 2023 at 12:21):

(using unsafe low-level List-module-only primitives)

view this post on Zulip Folkert de Vries (Mar 01 2023 at 12:36):

I'd like that to happen, but currently it would have worse performance

view this post on Zulip Folkert de Vries (Mar 01 2023 at 12:36):

maybe it's worth it for these List.mapN functions though

view this post on Zulip Folkert de Vries (Mar 01 2023 at 12:37):

for List.map itself I remember the performance is significantly worse. But these days my benchmarking harness segfauls and I cannot figure out why that is

view this post on Zulip Richard Feldman (Mar 01 2023 at 14:05):

eh let's not degrade perf then

view this post on Zulip Brendan Hansknecht (Mar 02 2023 at 05:55):

I think I found the bug. Will push a pr soon.

view this post on Zulip Brendan Hansknecht (Mar 02 2023 at 05:58):

essentially we were only decrementing the refcount of the last list in List.mapN functions.

view this post on Zulip Brendan Hansknecht (Mar 02 2023 at 06:00):

Entire change is to delete 2 dollar signs in a macro: https://github.com/roc-lang/roc/pull/5085/files

view this post on Zulip Brendan Hansknecht (Mar 02 2023 at 06:09):

@Nick Hallstrom when you get the chance, can you run your clock with this change and valgrind again. I assume there is more than 1 memory leak and want to try and track down what else is leaking memory. Another screenshot like you sent above would be useful in debugging this.

view this post on Zulip Nick Hallstrom (Mar 02 2023 at 20:17):

Got the next one for you
Seg Fault 2

view this post on Zulip Nick Hallstrom (Mar 02 2023 at 20:17):

So still another leak with Str.graphemes it looks like?

view this post on Zulip Brendan Hansknecht (Mar 02 2023 at 22:17):

Just wondering, is this a --optimized build?

view this post on Zulip Brendan Hansknecht (Mar 02 2023 at 22:22):

Also, Str.graphemes is not leaking anymore. It has one block of size 104 bytes. Which is the exact size of a list with 4 small strings in it. It's just that the program crashed, so valgrind noticed the memory is still in use when the program ended.

Hard to say about the other data. I mean, the 32 bytes in 4 blocks that was definitely lost is obviously a leak. I would guess the 1564 bytes in 46 blocks is also leaked, but would need to look at your program for other expected allocations to be sure.

view this post on Zulip Brendan Hansknecht (Mar 02 2023 at 22:24):

I think the stack overflow is a separate bug/issue. Maybe a problem with tail recursion or need for the --optimized flag. Given only, 10KB are in use at exit, that shouldn't be enough pressure to end up limiting the stack size.

view this post on Zulip Nick Hallstrom (Mar 02 2023 at 23:54):

Ya, if I didn’t have the —optimize flag, it would seg fault during compilation on the Pi. You were the one that found that workaround I believe.

view this post on Zulip Brendan Hansknecht (Mar 03 2023 at 00:11):

Ah, I guess I already forgot the exact contraints

view this post on Zulip Brendan Hansknecht (Mar 03 2023 at 00:12):

I'll take another look and see if I have any ideas around the stack overflow

view this post on Zulip Brendan Hansknecht (Mar 03 2023 at 01:00):

Ok. I think I found the issue. We have a bug in our generation of Effect.loop. It is not generating a tail recursive function. As such, after some giant number of runs, we get a stack overflow

Not sure if this is specific to your example for some reason or a general issue. Trying to make a minimal repro now.

view this post on Zulip Brendan Hansknecht (Mar 03 2023 at 01:11):

Yeah, so far having a hard time generating a repro. I assume that somehow it is related to inlining and llvm.

view this post on Zulip Brendan Hansknecht (Mar 03 2023 at 01:12):

I think you have a pretty long chain of tasks with Task.traverse on multiple lists

view this post on Zulip Brendan Hansknecht (Mar 03 2023 at 01:46):

Yeah, the generated llvm ir from clock.roc for Effect.loop does not have tail call optimization

view this post on Zulip Brendan Hansknecht (Mar 03 2023 at 01:46):

No idea why not so far.

view this post on Zulip Brendan Hansknecht (Mar 03 2023 at 02:03):

#5086 is a minimal repro. Not really sure what is going on though. Some sort of bad reaction of Task.loop being after other Tasks.

view this post on Zulip Brendan Hansknecht (Mar 03 2023 at 02:04):

@Nick Hallstrom if you can make your program work without init and startup, that will workaround the issue.

view this post on Zulip Brendan Hansknecht (Mar 03 2023 at 02:11):

Something like this:

main =
    Task.loop {initialized: Bool.false, lastNumbers: ["8", "8", "8", "8"]} run

...

run = \{initialized, lastNumbers} ->
    if !initialized then
        _ <- await init
        _ <- await startup
        Task.succeed (Step {initialized: Bool.true, lastNumbers})
    else
        _ <- await (Task.sleep 1000)
        n <- await Time.getTime
        numbers = Str.graphemes n
        _ <- await (setClock lastNumbers numbers)
        Task.succeed (Step {initialized, lastNumbers: numbers})

view this post on Zulip Brendan Hansknecht (Mar 03 2023 at 02:14):

Looks to fix the other memory leak as well

view this post on Zulip Nick Hallstrom (Mar 03 2023 at 03:25):

Oh awesome! I’ll give that a try in the morning. Thanks again Brendan

view this post on Zulip Nick Hallstrom (Mar 04 2023 at 00:59):

Just an update on this. It did indeed fix the issue. It’s been running for a day with all the sleeps taken out and is still going strong 💪

view this post on Zulip Richard Feldman (Mar 04 2023 at 01:28):

yoooo, awesome!

view this post on Zulip Richard Feldman (Mar 04 2023 at 01:29):

thanks for helping valgrind these bugs to the surface - now they can be fixed for everyone! :smiley:

view this post on Zulip Ayaz Hafiz (Mar 06 2023 at 04:24):

@Nick Hallstrom I believe https://github.com/roc-lang/roc/pull/5095 will solve the stack overflow you ran into above, without needing to workaround

view this post on Zulip Nick Hallstrom (Mar 06 2023 at 04:40):

Woohoo :tada: I’ll give it a try in the morning. Thanks so much 🙏

view this post on Zulip Nick Hallstrom (Mar 08 2023 at 05:31):

Got to this a bit late, but your fix seems to be working! Thanks a bunch Ayaz!


Last updated: Jul 05 2025 at 12:14 UTC