Stream: beginners

Topic: Learning questions with 1brc: no args with legacy linker?


view this post on Zulip Paul Joubert (Jun 14 2024 at 20:15):

Hi! I'm new here, and I'm learning Roc by rewriting some coding puzzles I've done previously in Haskell in Roc. I'm currently doing the One Billion Row Challenge (which is basically just parsing a list of cities with temperatures and returning the min/mean/max temperature for each city).

I've gotten thusfar, working from the CLI args example:

app [main] {
    pf: platform "https://github.com/roc-lang/basic-cli/releases/download/0.10.0/vNe6s9hWzoTZtFmNkvEICPErI9ptji_ySjicO6CkucY.tar.br",
}

import pf.Stdout
import pf.File
import pf.Path
import pf.Task
import pf.Arg

main =
    finalTask =
        pathArg = readFirstArgT!
        readFileToStr (Path.fromStr pathArg)

    finalResult <- Task.attempt finalTask

    when finalResult is
        Err ZeroArgsGiven ->
            Task.err (Exit 1 "Error ZeroArgsGiven")

        Err (ReadFileErr errMsg) ->
            Task.err (Exit 1 "Error ReadFileErr:\n$(errMsg)")

        Ok fileContentStr ->
            fileContentStr |> parse |> Stdout.line


readFirstArgT =
    args = Arg.list!
    dbg args
    List.get args 1 |> Result.mapErr (\_ -> ZeroArgsGiven) |> Task.fromResult


readFileToStr = \path ->
    path
    |> File.readUtf8
    |> Task.mapErr \_ -> ReadFileErr "Failed to read file: $(Path.display path)"


parse = \contents ->
    citiesList = Str.split contents "\n"

    results = List.walk citiesList (Dict.empty {}) \state, elem ->
        cityWithTemp = Str.split elem ";"
        city =
            when List.get cityWithTemp 0 is
                Ok v -> v
                Err _ -> crash "err"

        temp =
            when List.get cityWithTemp 1 is
                Ok v -> when Str.toF64 v is
                            Ok w -> w
                            Err _ -> crash "err"
                Err _ -> crash "err"

        Dict.update state city \value ->
            when value is
                Missing ->
                    Present { min: temp, mean: temp, max: temp }
                Present { min: oldMin, mean: oldMean, max: oldMax } ->
                    Present
                        {
                            min: Num.min temp oldMin,
                            mean: (temp + oldMean) / 2,
                            max: Num.max temp oldMax,
                        }

    results
    |> Dict.toList
    |> List.map \(city, {min, mean, max}) ->
        "$(city): $(Num.toStr min), $(Num.toStr mean), $(Num.toStr max)"
    |> Str.joinWith "\n"

roc dev tells me "The surgical linker currently has issue #3609 and would fail linking your app. Please use --linker=legacy to avoid the issue for now." I'm not sure which change made it so the surgical linker wouldn't work anymore, but I think it was inbetween when I realised I can't Stdout.line a Dict and have to manually create the string to print. When I run it with the legacy linker (roc dev --linker=legacy -- measurements.txt), it always results in ZeroArgsGiven, and dbg args gives [ ]. If I remember correctly, this wasn't so before I introduced whatever caused the problem with the linker. Could this be, or is it a coincidence? If so, what mistake am I making in my code?

I also wanted to know if there is a way to print, e.g., dicts. I'm used to elements having show in Haskell, so I can do print on basically any structure. Having to do that last results |> Dict.toList |> List.map ... |> Str.joinWith "\n" bit is a bit tedious tbh.

Then, I'm still working on my understanding of error handling, but I'm pretty sure this is sub-optimal:

when List.get cityWithTemp 1 is
                Ok v -> when Str.toF64 v is
                            Ok w -> w
                            Err _ -> crash "err"
                Err _ -> crash "err"

Is there something simple I'm missing? Or something more fundamental?

Thanks!

view this post on Zulip Brendan Hansknecht (Jun 14 2024 at 21:09):

Inspect.toStr should help you

view this post on Zulip Brendan Hansknecht (Jun 14 2024 at 21:09):

For printing a dict

view this post on Zulip Paul Joubert (Jun 14 2024 at 21:11):

Oh yes! I forgot about Inspect. Thanks.

view this post on Zulip Anton (Jun 15 2024 at 12:59):

I'll do a new basic-cli release soon that should improve the args parsing.
If you want you can use this to simplify the file reading a lot.

view this post on Zulip Paul Joubert (Jun 15 2024 at 13:30):

I'm already "watching" the basic-cli repo :) looking forward to it <3
About the import file syntax, does that compile the file into the binary? Because the one billion row file is 14GB haha. Also, I think I'd prefer to keep the file dynamic, so I can benchmark performance on different sizes without recompiling.

view this post on Zulip Paul Joubert (Jun 15 2024 at 13:40):

I've rerun the CLI args example as is, and it doesn't see the args when using the legacy linker, so it seems the problem is with using the legacy linker. Legacy linker means using shared system libraries and such, right? In which case maybe the problem is because I'm running NixOS. Does anyone know what in my code above prevents the surgical linker from working? I didn't see anything on the issue that seemed relevant.

view this post on Zulip Anton (Jun 15 2024 at 13:40):

About the import file syntax, does that compile the file into the binary?

Yes indeed.

Also, I think I'd prefer to keep the file dynamic, so I can benchmark performance on different sizes without recompiling.

Makes sense :)

view this post on Zulip Anton (Jun 15 2024 at 13:41):

In which case maybe the problem is because I'm running NixOS.

I'm on NixOS as well, I'll see if it is any different on my ubuntu laptop.

view this post on Zulip Anton (Jun 15 2024 at 13:47):

Now that I think of it; for the benchmark later, make sure to use:

$ roc build yourApp.roc --optimize

And then time the execution of ./yourApp cities.csv

view this post on Zulip Paul Joubert (Jun 15 2024 at 13:49):

Oh yeah, I've been intending to do that but haven't gotten it to build yet in the first place haha :smiling: but thanks tho

view this post on Zulip Paul Joubert (Jun 15 2024 at 13:50):

I've been following the meta conversation about Roc development, all Richard's talks and podcasts and such, so I know a lot about the way things work, I just haven't actually sat down and written code until now hahaha

view this post on Zulip Anton (Jun 15 2024 at 14:00):

Yeah the behavior is the same on Ubuntu, the tracking issue for this is here.

view this post on Zulip Paul Joubert (Jun 15 2024 at 14:00):

Oh huh thanks for confirming

view this post on Zulip Anton (Jun 15 2024 at 14:15):

The Str.toF64 is what requires the legacy linker, Str.toU64 does work (if I change cities.csv to natural numbers).

view this post on Zulip Paul Joubert (Jun 15 2024 at 14:20):

Hm I tried Str.toDec and it segfaulted :skull:

view this post on Zulip Anton (Jun 15 2024 at 14:20):

Yeah, I tried that too :smiling_face_with_tear:

view this post on Zulip Anton (Jun 15 2024 at 14:21):

I'll see if we have an issue for that

view this post on Zulip Anton (Jun 15 2024 at 14:29):

No issue, it doesn't happen every time when using Str.toDec (in other code), I'll minimize and make an issue.

view this post on Zulip Musab Nazir (Jun 15 2024 at 16:31):

I was doing the same challenge (1brc) few weeks ago and ran into what I understand is a refcount performance issue. And I think we also have an open issue to load a large file in chunks which is probably required for the full 1 billion row file
https://roc.zulipchat.com/#narrow/stream/231634-beginners/topic/Profiling.20an.20AoC.20app/near/440118892

These were some of my numbers for a super naive implementation that stopped after parsing and creating the dict

1M row file
roc reference: 103271ms
gleam reference: 2251ms
gleam (arrays): 2901ms
go reference: 140ms

view this post on Zulip Anton (Jun 15 2024 at 16:47):

btw this is the issue for reading in chunks: github.com/roc-lang/basic-cli/issues/205

view this post on Zulip Anton (Jun 15 2024 at 16:59):

I'll minimize and make an issue.

#6813

view this post on Zulip Anton (Jun 15 2024 at 18:08):

I've investigated the empty args issue further and also found a temporary workaround :)

view this post on Zulip Luke Boswell (Jun 15 2024 at 22:34):

Oh I implented that
https://github.com/roc-lang/basic-cli/pull/206

Just forgot to add a test script and make it ready for review

view this post on Zulip Luke Boswell (Jun 15 2024 at 22:35):

I forgot I had that sitting there

view this post on Zulip Paul Joubert (Jun 16 2024 at 15:10):

Ok for completion's sake, I managed to get things working!

Code and results:

app [main] {
    pf: platform "./basic-cli/platform/main.roc",
}

import pf.Stdout
import pf.File
import pf.Task
import pf.Arg

main =
    finalTask =
        pathArg = readFirstArgT!
        readFileToStr pathArg

    finalResult <- finalTask |> Task.attempt

    when finalResult is
        Err ZeroArgsGiven ->
            Task.err (Exit 1 "Error ZeroArgsGiven")
        Err (ReadFileErr errMsg) ->
            Task.err (Exit 1 "Error ReadFileErr:\n$(errMsg)")
        Ok fileContentStr ->
            fileContentStr |> parse |> Inspect.toStr |> Stdout.line!
        _ ->
            Task.err (Exit 1 "Unknown error")

readFirstArgT =
    args = Arg.list!
    List.get args 1 |> Result.mapErr (\_ -> ZeroArgsGiven) |> Task.fromResult

readFileToStr = \path ->
    path
    |> File.readUtf8
    |> Task.mapErr \_ -> ReadFileErr "Failed to read file: $(path)"

parse = \contents ->
    citiesList = Str.split contents "\n"

    List.walk citiesList (Dict.empty {}) \state, elem ->
        cityWithTemp = Str.split elem ";"
        [city, t] = cityWithTemp
        temp = Result.withDefault (Str.toF32 t) 0

        Dict.update state city \value ->
            when value is
                Missing -> Present { min: temp, mean: temp, max: temp }
                Present { min: oldMin, mean: oldMean, max: oldMax } ->
                    Present {
                        min: Num.min temp oldMin,
                        mean: (temp + oldMean) / 2,
                        max: Num.max temp oldMax,
                    }

hyperfine --warmup 1 -L num 1_000,10_000,100_000,1_000_000 "./main measurements_{num}.txt"

Benchmark 1: ./main ../1brc-hs/measurements_1_000.txt
  Time (mean ± σ):       6.9 ms ±   0.3 ms    [User: 5.8 ms, System: 1.1 ms]
  Range (min … max):     6.2 ms …   8.4 ms    261 runs

Benchmark 2: ./main ../1brc-hs/measurements_10_000.txt
  Time (mean ± σ):      67.6 ms ±   1.2 ms    [User: 65.2 ms, System: 2.0 ms]
  Range (min … max):    65.4 ms …  71.3 ms    42 runs

Benchmark 3: ./main ../1brc-hs/measurements_100_000.txt
  Time (mean ± σ):     661.0 ms ±   5.8 ms    [User: 654.7 ms, System: 3.9 ms]
  Range (min … max):   653.8 ms … 671.7 ms    10 runs

Benchmark 4: ./main ../1brc-hs/measurements_1_000_000.txt
  Time (mean ± σ):      6.557 s ±  0.041 s    [User: 6.510 s, System: 0.025 s]
  Range (min … max):    6.501 s …  6.621 s    10 runs

Which is faster than I got my Haskell code, in a few cases at least! Which was:

{-# LANGUAGE OverloadedStrings #-}
{-# OPTIONS_GHC -Wno-incomplete-uni-patterns #-}

module Main where

import System.Environment (getArgs)
import Data.Foldable (foldl')
import qualified Data.Map.Strict as M
import qualified Data.Text as T

type Results = M.Map T.Text (Float, Float, Float)

main :: IO ()
main = do
  [path] <- getArgs
  txt <- readFile path
  print $ M.toList . foldl' addCity M.empty . T.lines . T.pack $ txt

addCity :: Results -> T.Text -> Results
addCity results line =
  let (city, temp) = parseLine line
  in M.insertWith f city (temp, temp, temp) results
      where f (temp, _, _) (oldMin, oldMean, oldMax) =
              (min temp oldMin, (temp + oldMean) / 2, max temp oldMax)

parseLine :: T.Text -> (T.Text, Float)
parseLine line =
  (x, read (T.unpack y) :: Float)
    where [x, y] = T.splitOn ";" line

Which resulted in:

|       rows | runs | result (ms) |       ± |
|-----------:|-----:|------------:|--------:|
|      1 000 |  164 |        13.4 |     0.9 |
|     10 000 |   50 |        52.9 |     3.8 |
|    100 000 |   10 |       386.2 |    10.7 |
|  1 000 000 |   10 |     4 107.0 |    28.0 |
| 10 000 000 |   10 |    47 138.0 | 1 736.0 |

For some reason, the Haskell is faster from 100 000 rows. Since Roc starts out faster, maybe this is a file reading thing? Not sure. (I did compile with --optimize).

I do like the Haskell for being more concise, but then again it doesn't do any error handling.

view this post on Zulip Brendan Hansknecht (Jun 16 2024 at 17:01):

Just a note, we have a bug that effects dict containing strings and would be killing your perf

view this post on Zulip Brendan Hansknecht (Jun 16 2024 at 17:58):

Also, I'm really surprised that this works. It is probably a bug: [city, t] = cityWithTemp. Cause there is no guarantee the list is length 2.

view this post on Zulip Brendan Hansknecht (Jun 16 2024 at 18:49):

So, I wrote a terrible function to instead of using strings, use tuples of 32 U8s. The longest string was 26 bytes, so everything fits in these tuples, but they are way less efficient for hashing than a string. This should actually be a slower solution, but it work around a current refcounting bug.

Your code on my machine (so using Dict Str):

|       rows | runs | result (ms) |       ± |
|-----------:|-----:|------------:|--------:|
|      1 000 |  294 |         6.8 |     0.4 |
|     10 000 |   55 |        54.9 |     6.5 |
|    100 000 |   10 |       531.8 |    72.8 |
|  1 000 000 |   10 |     5 434.0 |   737.0 |
| 10 000 000 |   10 |    51 639.0 | 6 809.0 |

My slightly modified code to use tuples (so using Dict (U8, ..., U8)):

|       rows | runs | result (ms) |       ± |
|-----------:|-----:|------------:|--------:|
|      1 000 |  248 |         4.1 |     0.4 |
|     10 000 |  300 |         6.3 |     0.2 |
|    100 000 |   96 |        27.6 |     2.7 |
|  1 000 000 |   12 |       233.2 |     1.1 |
| 10 000 000 |   10 |     2 291.0 |    15.0 |

The Dict Str version should actually be faster than this second version with the tuples, but we have a refcounting bug to finish fixing. So the roc version should be about 20x faster for large inputs.

view this post on Zulip Paul Joubert (Jun 16 2024 at 20:15):

Brendan Hansknecht said:

I'm really surprised that this works. It is probably a bug: [city, t] = cityWithTemp. Cause there is no guarantee the list is length 2.

Oh huh, I think I wrote that with the intention of afterwards chasing down the type errors to do error handling, and then when the checker didn't complain I just forgot about it!

view this post on Zulip Paul Joubert (Jun 16 2024 at 20:17):

I wrote a terrible function

thank you, this is beautiful

view this post on Zulip Paul Joubert (Jun 16 2024 at 20:26):

It's so exciting to write functional/high-level code — and a pretty naive implementation also — that is just really fast!!

view this post on Zulip Paul Joubert (Jun 16 2024 at 20:54):

I tried to investigate the destructuring thing a bit, and tried this input:

Seoul;8.7
Minneapolis;3.5
El Paso;22.0
Helsinki;9.1
Foo
Reggane;15.4
Gaborone;25.3
Benghazi;31.6
Bridgetown;38.0
Algiers;26.8
Chihuahua;12.9

which resulted in

{
  "Seoul": {max: 8.699999809265137, mean: 8.699999809265137, min: 8.699999809265137 },
  "Minneapolis": {max: 3.5, mean: 3.5, min: 3.5 },
  "El Paso": {max: 22, mean: 22, min: 22 },
  "Helsinki": {max: 9.100000381469727, mean: 9.100000381469727, min: 9.100000381469727 },
  "Foo": {max: 9.100000381469727, mean: 9.100000381469727, min: 9.100000381469727 },
  "Reggane": {max: 15.399999618530273, mean: 15.399999618530273, min: 15.399999618530273 },
  "Gaborone": {max: 25.299999237060547, mean: 25.299999237060547, min: 25.299999237060547 },
  "Benghazi": {max: 31.600000381469727, mean: 31.600000381469727, min: 31.600000381469727 },
  "Bridgetown": {max: 38, mean: 38, min: 38 },
  "Algiers": {max: 26.799999237060547, mean: 26.799999237060547, min: 26.799999237060547 },
  "Chihuahua": {max: 12.899999618530273, mean: 12.899999618530273, min: 12.899999618530273 },
  "": {max: 12.899999618530273, mean: 12.899999618530273, min: 12.899999618530273 }
}

So, it carried over the previous value for t during the walk.

I also noticed that it added a blank string at the bottom, which is because splitting the input on "\n" results in a blank string at the end (or rather, the file string ends on a \n):

» Str.split "Seoul;8.7\nMinneapolis;3.5\n" "\n"
["Seoul;8.7", "Minneapolis;3.5", ""] : List Str

Anyway, when I insert dbg cityWithTemp above [city, t] = cityWithTemp, compilation fails with

This looks like an operator, but it's not one I recognize!

45│          dbg cityWithTemp
46│          [city, t] = cityWithTemp
                       ^

I have no specific suggestion for this operator, see TODO for the full
list of operators in Roc.

I remember trying [city, t] = Str.split elem ";", which, trying again, results in the same "looks like an operator" error. I originally thought this was maybe because split can fail, but I realise now it can't, so it must be something else.

Adding dbg after the line works. This:

 ...
    List.walk citiesList (Dict.empty {}) \state, elem ->
        cityWithTemp = Str.split elem ";"
        [city, t] = cityWithTemp
        dbg cityWithTemp
        dbg t
        temp = Result.withDefault (Str.toF32 t) 0
        dbg temp

        Dict.update state city \value -> ...

gives

...
[main.roc:46] cityWithTemp = ["Helsinki", "9.1"]
[main.roc:47] t = "9.1"
[main.roc:49] temp = 9.100000381469727

[main.roc:46] cityWithTemp = ["Foo"]
[main.roc:47] t = "9.1"
[main.roc:49] temp = 9.100000381469727

[main.roc:46] cityWithTemp = ["Reggane", "15.4"]
[main.roc:47] t = "15.4"
[main.roc:49] temp = 15.399999618530273
...

view this post on Zulip Brendan Hansknecht (Jun 16 2024 at 21:24):

Can you try Str.splitFirst there instead of split?

view this post on Zulip Brendan Hansknecht (Jun 16 2024 at 21:25):

I think it should be faster and give you a record that can be matched inline like that. As I mentioned earlier, I don't think[city, t] = ... is actually expected to work in general. So it may confuse the compiler.

view this post on Zulip Brendan Hansknecht (Jun 16 2024 at 21:29):

Also, the value carry over is definitely a bug. If you could file an issue on GitHub, that would be great.

view this post on Zulip Paul Joubert (Jun 17 2024 at 13:40):

Brendan Hansknecht said:

Can you try Str.splitFirst there instead of split?

I tried this (not sure if the nested whens is best way to express this :thinking: )

parse = \contents ->
    citiesList = contents |> Str.split "\n"

    List.walk citiesList (Dict.empty {}) \state, elem ->
        when Str.splitFirst elem ";" is
            Ok {before: city, after} ->
                temp = after |> Str.toF32 |> Result.withDefault 0

                Dict.update state city \value ->
                    when value is
                        Missing -> Present { min: temp, mean: temp, max: temp }
                        Present { min: oldMin, mean: oldMean, max: oldMax } ->
                            Present {
                                min: Num.min temp oldMin,
                                mean: (temp + oldMean) / 2,
                                max: Num.max temp oldMax,
                            }

            Err NotFound -> state

and it's actually a tiny bit slower. But by like 1ms in the 1000 line case and 200ms in the 1m line case, so could be insignificant.

I don't think[city, t] = ... is actually expected to work in general

oohhh I misunderstood earlier, sorry.

If you could file an issue on GitHub, that would be great.

Will do! :salute:

view this post on Zulip Brendan Hansknecht (Jun 17 2024 at 15:21):

and it's actually a tiny bit slower

This intuitively makes some sense to me. I'm pretty sure that [city, t] = ... has no error handling. So it runs faster due to having a bug that is hit if the list isn't exactly (at least?) 2 elements.

Also, does it help if you change the error case to:
Err NotFound -> crash "impossible"

Sometimes crashing leads to better optimizations by the compiler

view this post on Zulip Paul Joubert (Jun 17 2024 at 15:27):

Issue #6816: Destructuring like [a, b] = ["foo"] should fail, but value for b is held over from previous walk element

view this post on Zulip Paul Joubert (Jun 17 2024 at 15:49):

Brendan Hansknecht said:

Also, does it help if you change the error case to:
Err NotFound -> crash "impossible"

Problem is, the error case is reached because of the newline at the end of the file and Str.splitadds blank strings, e.g., Str.split "foo" "o" => ["f", "", ""].
I think this is expected behaviour since Haskell's splitOn also does this. I was actually looking for an equivalent of lines.

view this post on Zulip Paul Joubert (Jun 17 2024 at 15:55):

I see there's Str.trim, will try that in a moment.

view this post on Zulip Musab Nazir (Jun 17 2024 at 16:06):

An interesting observation:
You guys were getting 5-6s for the 1M files and even though my implementation (Dict Str) was virtually identical my M1 mac would finish consistently around 100-110s. After looking into it a lot, it came down to my input generation script. The one I was using, would create the correct number of rows, but it had waaay more unique cities. When I swapped the script to one linked on the official 1brc site I started getting ~5s. This input script only has 413 unique cities whereas my old one could go up to 44k which exacerbates the Dict issue I think

view this post on Zulip Paul Joubert (Jun 17 2024 at 16:07):

Trimming the newline at the end before splitting, and using crash on the nofound error:

Benchmark 1: ./main ../1brc-hs/measurements_1_000.txt
  Time (mean ± σ):       7.1 ms ±   0.3 ms    [User: 5.7 ms, System: 1.3 ms]
  Range (min … max):     6.4 ms …   8.4 ms    267 runs

Benchmark 2: ./main ../1brc-hs/measurements_10_000.txt
  Time (mean ± σ):      69.2 ms ±   0.7 ms    [User: 67.0 ms, System: 1.7 ms]
  Range (min … max):    68.0 ms …  71.8 ms    41 runs

Benchmark 3: ./main ../1brc-hs/measurements_100_000.txt
  Time (mean ± σ):     677.7 ms ±   4.1 ms    [User: 670.1 ms, System: 5.0 ms]
  Range (min … max):   672.5 ms … 686.2 ms    10 runs

Benchmark 4: ./main ../1brc-hs/measurements_1_000_000.txt
  Time (mean ± σ):      6.839 s ±  0.060 s    [User: 6.777 s, System: 0.029 s]
  Range (min … max):    6.755 s …  6.920 s    10 runs

I think this is within the margin of error of the prevous version.

view this post on Zulip Paul Joubert (Jun 17 2024 at 16:09):

@Musab Nazir oh interesting, and yeah, my test files were generated by the "official" python script.

view this post on Zulip Brendan Hansknecht (Jun 17 2024 at 18:53):

This input script only has 413 unique cities whereas my old one could go up to 44k which exacerbates the Dict issue I think

Yeah, the issue is linear with the size of the dict

view this post on Zulip Brendan Hansknecht (Jun 17 2024 at 18:53):

Basically, it is recursively updating all of the refcounts of the stored string keys

view this post on Zulip Brendan Hansknecht (Jul 18 2024 at 06:19):

This is a basic impl of 1brc using the latest roc (including #6911) and basic-cli: https://github.com/bhansconnect/roc-1brc/blob/main/1brc.roc

It's only single threaded, so obviously the performance can't be that good. I'm sure it also has tons of room for improvement (probably a lot within roc itself).

Perf result on the full 1 billion rows:

M1 mac 8GB ram: 145s

Linux machine 32GB ram: 272s


Assuming perfect parallelism over the 8 cores the test uses, we would expect something in the range of:
M1: 18s
Linux: 34s

For context, good results for the expanded key version of the test start at about 20s and go down as low as 3s. That said, good results also do a lot more tricks than we do (like swar for finding the semicolons).


Note: I used the python generator, which actually leads to slower results cause it uses 8,921 weather station names instead of just the 413 of the java generator.

view this post on Zulip Brendan Hansknecht (Jul 18 2024 at 06:20):

Comparing to https://benhoyt.com/writings/go-1brc/

I am doing something roughly similar to r4. Which took 51s in golang.

view this post on Zulip Brendan Hansknecht (Jul 18 2024 at 06:33):

To get a bit more of a direct comparison. This is the timing of r4 on my machines with my same input data:
M1: 66s
Linux: 73s

So roughly 2.2x slower currently on M1 and 3.7x slower on linux.

view this post on Zulip Brendan Hansknecht (Jul 18 2024 at 06:35):

Perf of the roc code is split as such:
25% list.splitFirst
25% parsing the rest of row (probably the pattern match)
50% Dict.update

view this post on Zulip Richard Feldman (Jul 18 2024 at 12:12):

I wonder how much of that is algorithmic differences

view this post on Zulip Richard Feldman (Jul 18 2024 at 12:12):

not sure how splitFirst could be made significantly faster! :big_smile:

view this post on Zulip Brendan Hansknecht (Jul 18 2024 at 15:34):

What do you mean by algorithmic?

view this post on Zulip Brendan Hansknecht (Jul 18 2024 at 15:41):

Aside from using an mmap instead of buffered io, the two implementations are almost identical algorithmically. With the caveat that I use pattern matching while go uses manual but roughly equivalent code.

view this post on Zulip Brendan Hansknecht (Jul 18 2024 at 16:00):

Also, at a minimum for split, it is doing a ton of redundant checks by calling List.sublist twice. But I would guess that iterating in roc with refcount, tags, and list checks probably costs a solid amount then a quick loop in zig.

Also, for bytes specifically, you can use simd or swar to find much faster.

view this post on Zulip Richard Feldman (Jul 18 2024 at 16:29):

Brendan Hansknecht said:

Also, at a minimum for split, it is doing a ton of redundant checks by calling List.sublist twice. But I would guess that iterating in roc with refcount, tags, and list checks probably costs a solid amount then a quick loop in zig.

oh interesting! do you think it would be viable to move more of that to zig to avoid redundant checks?

view this post on Zulip Brendan Hansknecht (Jul 18 2024 at 17:06):

Yeah, I think we could get a decent gain from moving it into zig.

Also, in this specific case, a bytes specific version could use simd or swar. Which would be even faster, but I don't think the go version does that.

view this post on Zulip Richard Feldman (Jul 18 2024 at 18:13):

I also like the idea of doing that!

view this post on Zulip Richard Feldman (Jul 18 2024 at 18:14):

working with bytes is super common

view this post on Zulip Richard Feldman (Jul 18 2024 at 18:14):

and having builtins just automatically run faster for common cases like that seems great

view this post on Zulip Richard Feldman (Jul 18 2024 at 18:14):

it's a good use of builtins :big_smile:

view this post on Zulip Brendan Hansknecht (Jul 18 2024 at 20:20):

Yeah, one note related to that, would also be great if we allowed for special ability specialization for lists of bytes. For example, hashing has addBytes, but you have to wrap a List U8 in an opaque type to use it. So List U8 hashing is slow by default.

I should actually double check and verify this

view this post on Zulip Brendan Hansknecht (Jul 18 2024 at 20:49):

Yeah, doubled checked, List U8 hashing is on the slow path by default. It hashes each element individually.

view this post on Zulip Richard Feldman (Jul 18 2024 at 20:55):

oh yeah we should totally do that one too

view this post on Zulip Richard Feldman (Jul 18 2024 at 20:56):

same with equality checks for sure

view this post on Zulip Brendan Hansknecht (Jul 18 2024 at 20:57):

yeah, and with equality checks, we can use treat any list of types without pointers (or opaques) as a big list of bytes and just blit over them instead of caring about padding.

view this post on Zulip Brendan Hansknecht (Jul 18 2024 at 20:59):

Actually, i guess we could do that with hashing as well...maybe....depends if we want hashing a b and c to be equivalent to hashing [a,b,c]. Where a/b/c are variables of some random type alias.

view this post on Zulip Brendan Hansknecht (Jul 18 2024 at 21:00):

Oh wait, list hashing is already different than just hashing the elements...so yeah, any type that can be seen as a bag of bits (memcpy safe in the llvm backend) could use way faster list hashing.

view this post on Zulip Brendan Hansknecht (Jul 18 2024 at 21:00):

That said, I do think there is a problem here with ordering of passes. Abilities are generated when we don't know concrete types. These optimizations require knowing the concrete final type.

view this post on Zulip Richard Feldman (Jul 18 2024 at 21:04):

concrete as in monomorphized?

view this post on Zulip Richard Feldman (Jul 18 2024 at 21:05):

I'd be surprised if there wasn't some way to get that to work without restructuring things, although @Ayaz Hafiz would know better

view this post on Zulip Brendan Hansknecht (Jul 18 2024 at 21:09):

Oh, I bet we could just make hashList a low level. So the ability dispatches to hashList. Then the low level should know the concrete type and be able to dispatch farther.

view this post on Zulip Brendan Hansknecht (Jul 18 2024 at 21:10):

Or at least something roughly like that. So kinda two steps.

view this post on Zulip Brendan Hansknecht (Jul 18 2024 at 21:14):

Also, as much as I would love to keep dict in roc (and think it should long term be possible to make efficient in roc), we probably could get some large perf gains by moving dict to zig. (oh, and making sure we enable borrow inference for dict and set the same as list and str).

That said, it would be a pretty large project to move dict and may not be worth it for now.

view this post on Zulip Brendan Hansknecht (Jul 18 2024 at 21:32):

Do we have a location in the compiler pipeline after/during specialization that we can pick a lowlevel we generate? Thinking about trying to specialize before we get to the backends instead of needing to deal with the specialization in each of the backends. So backends would still just dispatch a function call to zig without any special wiring. I'm assuming this would just be a pass on mono ir. I assume there is already a location in the compiler to add something like this.

view this post on Zulip Richard Feldman (Jul 18 2024 at 22:07):

I believe so, although I'm not sure offhand where it is

view this post on Zulip Richard Feldman (Jul 18 2024 at 22:08):

actually I think the lowlevels all do that

view this post on Zulip Richard Feldman (Jul 18 2024 at 22:08):

so if you search for pattern matches on lowlevels you might find it (I'm on mobile right now)

view this post on Zulip Brendan Hansknecht (Jul 19 2024 at 01:00):

Also, ran this through the poop and definitely some heavy memory issues (I need to test with a large bufreader instead of an mmap). That may be a large part of the perf, but memory issues might be from something else (like mmap loading pages may be label as a cache miss but not actually be that expensive, not sure):

$ poop -d 1 "./r4 data/measurements_1_000_000_000.txt" "./1brc data/measurements_1_000_000_000.txt"
Benchmark 1 (3 runs): ./r4 data/measurements_1_000_000_000.txt
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          87.3s  ± 3.30s     83.5s  … 89.5s           0 ( 0%)        0%
  peak_rss           4.15MB ±  227KB    3.89MB … 4.28MB          0 ( 0%)        0%
  cpu_cycles          276G  ±  958M      275G  …  277G           0 ( 0%)        0%
  instructions        471G  ±  175M      471G  …  471G           0 ( 0%)        0%
  cache_references   16.3G  ±  265M     16.0G  … 16.6G           0 ( 0%)        0%
  cache_misses       1.06M  ±  224K      870K  … 1.31M           0 ( 0%)        0%
  branch_misses      2.75G  ± 21.6M     2.74G  … 2.77G           0 ( 0%)        0%
Benchmark 2 (3 runs): ./1brc data/measurements_1_000_000_000.txt
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           169s  ± 10.1s      159s  …  180s           0 ( 0%)        💩+ 94.1% ± 19.6%
  peak_rss           16.0GB ± 74.5KB    16.0GB … 16.0GB          0 ( 0%)        💩+385274.2% ±  9.2%
  cpu_cycles          640G  ± 2.13G      638G  …  642G           0 ( 0%)        💩+131.9% ±  1.4%
  instructions       1.40T  ± 1.59G     1.40T  … 1.40T           0 ( 0%)        💩+197.5% ±  0.5%
  cache_references   10.9G  ±  133M     10.8G  … 11.0G           0 ( 0%)        ⚡- 33.0% ±  2.9%
  cache_misses        462M  ± 1.43M      460M  …  463M           0 ( 0%)        💩+43454.8% ± 218.7%
  branch_misses      2.95G  ± 30.2M     2.92G  … 2.97G           0 ( 0%)        💩+  7.4% ±  2.2%

Highlights ~45x as many cache misses and ~3x as many instructions.

view this post on Zulip Brendan Hansknecht (Jul 19 2024 at 01:01):

Also, totally possible all the cache misses are from our dictionary being in two different allocation along with the memory pressure from the giant mmap.

view this post on Zulip Brendan Hansknecht (Jul 19 2024 at 01:02):

Also, note, these times are way faster than the times above due to not clearing the page cache between runs. So the file was already in ram more or less.

view this post on Zulip Brendan Hansknecht (Jul 19 2024 at 06:39):

So with a better version of readline, perf is even worse than with the mmap. This version of readline actually accumulates to a RocList directly instead of a Vec (which helped a lot, but not enough):

Benchmark 1 (3 runs): ./r4 data/measurements_1_000_000_000.txt
  wall_time          77.6s  ± 4.87s     74.5s  … 83.2s           0 ( 0%)        0%
Benchmark 2 (3 runs): ./1brc-mmap data/measurements_1_000_000_000.txt
  wall_time           177s  ± 7.44s      168s  …  182s           0 ( 0%)        💩+127.7% ± 18.4%
Benchmark 3 (3 runs): ./1brc-readlines data/measurements_1_000_000_000.txt
  wall_time           255s  ± 2.91s      252s  …  258s           0 ( 0%)        💩+228.0% ± 11.7%

Last updated: Jul 06 2025 at 12:14 UTC