Stream: beginners

Topic: Profiling an AoC app


view this post on Zulip Ayaz Hafiz (Dec 09 2023 at 02:49):

Spoiler: this contains an answer to part 1 of day 8 of AoC 2023.

I feel like I'm missing something. Does anyone see anything obviously wrong with this code? (Forget about the parsing bit, it is immaterial in performance)

app "AoC"
    packages {
        pf: "https://github.com/roc-lang/basic-cli/releases/download/0.7.0/bkGby8jb0tmZYsy2hg1E_B2QrCgcSTxdUlHtETwm5m4.tar.br",
        parser: "https://github.com/lukewilliamboswell/roc-parser/releases/download/0.2.0/dJQSsSmorujhiPNIvJKlQoI92RFIG_JQwUfIxZsCSwE.tar.br",
    }
    imports [
        pf.Stdout,
        pf.Task.{ Task },
        parser.Core.{
            Parser,
            const,
            keep,
            skip,
            oneOf,
            many,
            sepBy1,
        },
        parser.String.{
            string,
            parseStr,
            codeunitSatisfies,
        },
        "day-8-input.txt" as input : Str,
    ]
    provides [main] to pf

main : Task {} *
main =
    Stdout.line part1
    |> Task.await \_ -> Stdout.line ""

part1 : Str
part1 =
    input
    |> Str.trim
    |> parseMap
    |> stepsAAAToZZZ
    |> Num.toStr

Direction : [Left, Right]
Node : { left : Str, right : Str }
Network : Dict Str Node
Map : { directions : List Direction, network : Network }

findNode : Network, Str -> Node
findNode = \network, node ->
    when Dict.get network node is
        Ok n -> n
        Err _ -> crash "bad input: node \(node) not found"

stepsAAAToZZZ : Map -> Nat
stepsAAAToZZZ = \{ directions, network } ->
    go = \steps, ds, d, cur ->
        if cur == "ZZZ" then
            steps
        else
            when List.get ds (Num.rem d (List.len ds)) is
                Ok Left -> go (steps + 1) directions (d + 1) (findNode network cur).left
                Ok Right -> go (steps + 1) directions (d + 1) (findNode network cur).right
                Err _ -> crash "bad input"

    go 0 directions 0 "AAA"

parseMap : Str -> Map
parseMap = \str ->
    pDirection : Parser _ Direction
    pDirection = oneOf [
        const Left |> skip (string "L"),
        const Right |> skip (string "R"),
    ]

    pDirections = many pDirection

    pLabel : Parser _ Str
    pLabel =
        const \c ->
            when Str.fromUtf8 c is
                Ok s -> s
                Err _ -> crash "bad input: non-utf8 character"
        |> keep (many (codeunitSatisfies \b -> b >= 'A' && b <= 'Z'))

    pNode : Parser _ (Str, Node)
    pNode =
        const (\node -> \left -> \right -> (node, { left, right }))
        |> keep pLabel
        |> skip (string " = (")
        |> keep pLabel
        |> skip (string ", ")
        |> keep pLabel
        |> skip (string ")")

    pNetwork : Parser _ Network
    pNetwork =
        const Dict.fromList
        |> keep (sepBy1 pNode (string "\n"))

    pMap : Parser _ Map
    pMap =
        const (\directions -> \network -> { directions, network })
        |> keep pDirections
        |> skip (string "\n\n")
        |> keep pNetwork

    when parseStr pMap str is
        Ok g -> g
        Err (ParsingFailure e) | Err (ParsingIncomplete e) -> crash "bad input: '\(e)'"

If I run this as-is, without opt or #6216, it takes ~9 seconds. stepsAAAToZZZ#go makes 22K iterations.

If I compile with opt and #6216 it's at 860ms. But this is still paltry compared to a simple Node.js version:

❯ hyperfine --warmup 2 ./day8 'node day8.js'
Benchmark 1: ./day8
  Time (mean ± σ):     867.2 ms ±  17.8 ms    [User: 856.2 ms, System: 5.4 ms]
  Range (min … max):   848.4 ms … 895.3 ms    10 runs

Benchmark 2: node day8.js
  Time (mean ± σ):      35.0 ms ±   1.5 ms    [User: 29.7 ms, System: 5.3 ms]
  Range (min … max):    31.5 ms …  38.6 ms    74 runs

Summary
  'node day8.js' ran
   24.77 ± 1.18 times faster than './day8'
const fs = require('fs');
const path = require('path');

const input = fs.readFileSync(path.resolve(__dirname, './day-8-input.txt'), 'utf8');

const dir = {left: 0, right: 1};

function parseInput(input) {
  const [directions, nodes] = input.trim().split('\n\n');
  const dirs = directions.split('').map(d => d === 'L' ? dir.left : dir.right);
  const network = {};
  const parsedNodes = nodes.split('\n').forEach(n => {
    const [name, lrs] = n.split(' = (');
    const left = lrs.substring(0, 3);
    const right = lrs.substring(3 + ', '.length, 3 + ', '.length + 3);
    network[name] = {left, right};
  });
  return {dirs, network};
}

const {dirs, network} = parseInput(input);

function stepsAAAToZZZ(dirs, network) {
  let steps = 0;
  let cur = "AAA";
  while (cur !== "ZZZ") {
    const {left, right} = network[cur];
    const dir = dirs[steps % dirs.length];
    cur = dir === 0 ? left : right;
    steps++;
  }
  return steps;
}

console.log(stepsAAAToZZZ(dirs, network));

I ran a profile and all the time is spent in Dict and doing str refcounts. But 800ms for 22K iterations of that loop feels insane to me. The input file has ~700 nodes. So in the absolute worst case, where each node must be scanned, this Roc program amortizes to 800ms/22K/700~=5ns per scan. While that might be reasonable, it seems super unlikely in practice.

view this post on Zulip Brendan Hansknecht (Dec 09 2023 at 02:59):

Dicts with refcounts keys or values are currently a pathological case

view this post on Zulip Brendan Hansknecht (Dec 09 2023 at 03:00):

Cause we don't have any sort of smarter refcounting or borrowing, each call to dict.get will increment the refcount and then decrement it.

This will be recursive

view this post on Zulip Brendan Hansknecht (Dec 09 2023 at 03:00):

So it will spend all the time traversing all elements of the dict incrementing and decrementing refcounts

view this post on Zulip Brendan Hansknecht (Dec 09 2023 at 03:00):

This is why I want automatic borrowing

view this post on Zulip Brendan Hansknecht (Dec 09 2023 at 03:04):

If the strings are small and you write a when to convert them to a static sized tuple, it would run way way faster

view this post on Zulip Brendan Hansknecht (Dec 09 2023 at 03:05):

Oh, also, it will chase the node and increment those two strings as well.

view this post on Zulip Ayaz Hafiz (Dec 09 2023 at 03:12):

I tried it with a assoc list too and it was just as slow

view this post on Zulip Brendan Hansknecht (Dec 09 2023 at 03:15):

Did you make a get method? If so it took ownership of the list, requiring a recursive increment when called and a recursive decrement in the function when the list is dropped. Has to be manually inlined to avoid this

view this post on Zulip Brendan Hansknecht (Dec 09 2023 at 03:16):

It's a huge perf problem for anything wrapping list.getin another function when the list contains recounted types

view this post on Zulip Brendan Hansknecht (Dec 09 2023 at 03:25):

Also, before we fix add the borrowing, probably fixing this recounting worst case it's more important. I think I'll work on that next. Shouldn't be too hard

view this post on Zulip Ryan Bates (Dec 09 2023 at 04:55):

@Ayaz Hafiz

AoC Day 8 spoiler

view this post on Zulip Brendan Hansknecht (Dec 09 2023 at 05:21):

Just to clearly show how fast this should be. Just took Ayaz's code and changed from a Str to a (U8, U8, U8):

Benchmark 1: /tmp/aoc
  Time (mean ± σ):     861.2 ms ±  24.9 ms    [User: 855.8 ms, System: 1.5 ms]
  Range (min … max):   828.2 ms … 894.9 ms    10 runs

Benchmark 2: /tmp/aoc-u8
  Time (mean ± σ):       7.9 ms ±   0.2 ms    [User: 6.4 ms, System: 0.9 ms]
  Range (min … max):     7.5 ms …   8.5 ms    270 runs

Summary
  '/tmp/aoc-u8' ran
  109.63 ± 4.14 times faster than '/tmp/aoc'

New version of the code

view this post on Zulip Brendan Hansknecht (Dec 09 2023 at 05:22):

Once we fix the recursive list refcounting, these two should run at essentially the same speed

view this post on Zulip Luke Boswell (Dec 09 2023 at 05:40):

Nice work Brendan, this is really cool so see how performance can be impacted like this, and that there are solutions.

view this post on Zulip Brendan Hansknecht (Dec 09 2023 at 05:43):

Haha, now I just need to fix the underlying issue in the compiler

view this post on Zulip Declan Joseph Maguire (Dec 09 2023 at 05:44):

I have no contributions to this little optimisation saga, I just want to say that I'm loving it. There's something hypnotic about watching something get optimised.

view this post on Zulip Brendan Hansknecht (Dec 09 2023 at 05:51):

Should be roughly:

  1. List and Str store their full size on the heap (only needs to be updated when the list refcount goes from 1 to 2).
  2. Using this, seamless slices can free all elements of an underlying list
  3. All lists to the same allocation only keep a single reference to each inserted element. As such list will inc the refcount on insert, but will never touch the refcount recursively when passing the list around.
  4. When deallocating a list or slice, this is when we have to do all the recursive deallocations.
  5. Also, if we actually clone a list, we will still have to recursively increment the recount.

Hmm, though on the heap I have to think about capacity vs size. Cause ideally we want to know the full capacity for in the future when roc_dealloc is passed the full allocation size. But for recount purposes, we want to know the full size such that the slice can just decrease the refcounts of real elements, not old junk elements. I'll have to think about this a bit more, but something along those lines should free us from this problem (though will keep some refcounted objects alive for longer)

view this post on Zulip Brendan Hansknecht (Dec 09 2023 at 07:58):

@Folkert de Vries do my comments here sound roughly right to you? Anything I am missing? I know you have thought about this recursive refcounting issue before.

view this post on Zulip Folkert de Vries (Dec 09 2023 at 13:17):

well fundamentally this is a simple change. But there are many places now where the assumption is made that an inc/dec on a list is recursive, and we have to track them all down and fix them

view this post on Zulip Brendan Hansknecht (Dec 09 2023 at 16:52):

For sure, there will be memory leaks and double frees for a while as I do this. Hopefully due to tests and valgrind I'll be able to dig through those

view this post on Zulip Ayaz Hafiz (Dec 09 2023 at 18:17):

Im not sure that refcounting is enough to explain the problem here. In a trace I ran, refcounts of the inner strings as 70ms/300ms of the profile. So I'm not sure that automatically translates to a 100x improvement

view this post on Zulip Ayaz Hafiz (Dec 09 2023 at 18:18):

I also wrote this program in Koka, and it ran in about the same time as the JS

view this post on Zulip Richard Feldman (Dec 09 2023 at 18:27):

the refcounts also mess with the cache though

view this post on Zulip Richard Feldman (Dec 09 2023 at 18:27):

I'd try running it with refcounts disabled (like they're a no-op) and see how that compares

view this post on Zulip Brendan Hansknecht (Dec 09 2023 at 18:30):

Does koka have the recursive refcount issue where if you pass a List Str to a function, it will end up recursively incrementing the refcount of all strings in the list? Or in this case Dict Str {Str, Str} so 3x as many strings to increment. Then at the end of the function it will recursively decrement the refcount of all the strings. (in this case, technically it is just traversing all the strings and realizing they are small then moving to the next one.)

You need full debug info from dwarf and perf using it to see the inlined functions. Almost all of the refcounted functions are inlined

Purple is refcounting. It is 98.6% of the program execution:
Screenshot-2023-12-09-at-10.30.06AM.png

view this post on Zulip Brendan Hansknecht (Dec 09 2023 at 18:38):

Flamegraph of my faster version with U8s Purple is still refcounting. Essentially all of the time is spent in the parser:
Screenshot-2023-12-09-at-10.37.10AM.png

~5% of the 109x faster version of the application is spent in refcounting

view this post on Zulip Brendan Hansknecht (Dec 09 2023 at 18:39):

Also, wow, the parser has a deep call stack.

view this post on Zulip Musab Nazir (May 22 2024 at 15:22):

Dicts with refcounts keys or values are currently a pathological case

Is this still a WIP issue? I ran into a problem where my roc code was running almost a 1000x slower than similar go code. Playing around with it a bit more I realized most of this overhead is coming from using Str key values for my Dict. Using a num type puts me back in go ballpark numbers

view this post on Zulip Brendan Hansknecht (May 23 2024 at 10:54):

Yeah, I have a PR that is mostly done and I need to circle back to.

view this post on Zulip Brendan Hansknecht (May 23 2024 at 10:55):

That said, may be slow to finish. I think most remaining failing tests are segfaults. (Probably also have a number of memory leaks)


Last updated: Jul 06 2025 at 12:14 UTC