Stream: beginners

Topic: Reading a local file in a stream


view this post on Zulip Karakatiza (May 06 2024 at 21:07):

Is there currently a reasonable cross-platform way to read a local file in chunks, e.g. with basic-cli Tcp.readLine?
On Ubuntu I will be trying to use netcat program to serve the file and read it in a Tcp stream, but I can't find an analogous built-in tool for e.g. Windows

view this post on Zulip Richard Feldman (May 06 2024 at 22:40):

so like Tcp.readLine but for file I/O instead?

view this post on Zulip Karakatiza (May 06 2024 at 22:42):

Yes, essentially - to be able to handle large files
I realized in my case I probably won't get bottlenecked by the current strict File.readUtf8 for now, but this is a use-case to consider

view this post on Zulip Anton (May 07 2024 at 13:20):

I made https://github.com/roc-lang/basic-cli/issues/205 for this

view this post on Zulip Karakatiza (May 08 2024 at 01:52):

@Musab Nazir if you are on Linux, you could try to combine serving your file with netcat and stream process it using the Tcp builtin module
Maybe
https://gist.github.com/talwai/d94c71ca09729ac655b4 will be of help, which you can execute with the Cmd builtin module

view this post on Zulip Luke Boswell (May 08 2024 at 03:27):

What if we just add a File.readLine similar to Rust's std::io::BufRead read_line method?

view this post on Zulip Luke Boswell (May 08 2024 at 03:27):

Then you can just write a task to process lines and call it using Task.loop

view this post on Zulip Luke Boswell (May 08 2024 at 03:30):

Maybe you call File.openBuffered to get a handle first?

view this post on Zulip Karakatiza (May 08 2024 at 03:32):

That would be ideal
With File.readLine, File.readUpTo and File.readExactly to match Tcp builtin

I can't contribute as I am not familiar with Rust

view this post on Zulip Karakatiza (May 08 2024 at 03:35):

Another idea is maybe these functions can be unified between File and Tcp modules with a single Stream module

view this post on Zulip Richard Feldman (May 08 2024 at 03:43):

I have a design somewhere for a general Stream concept

view this post on Zulip Richard Feldman (May 08 2024 at 03:43):

maybe now's the time to implement it! :smiley:

view this post on Zulip Luke Boswell (May 08 2024 at 05:05):

Thought I might have a go and see how to implement it. It works, but not sure this is the right direction.

https://github.com/roc-lang/basic-cli/pull/206

$ roc examples/file-read-buffered.roc
🔨 Rebuilding platform...
Read 119 bytes
Read 1 bytes
Read 52 bytes
Read 1 bytes
Read 496 bytes
Read 1 bytes
Read 22 bytes
Read 1 bytes
Read 194 bytes
Read 1 bytes
Read 330 bytes
Read 1 bytes
Read 52 bytes
Read 1 bytes
Read 181 bytes
Read 1 bytes
Read 461 bytes
Done reading, got {bytesRead: 1915, linesRead: 17}

view this post on Zulip Luke Boswell (May 08 2024 at 05:05):

I'm using this to keep track of the buffered file reader

thread_local! {
    static READERS: RefCell<Vec<Rc<RefCell<BufReader<File>>>>> = RefCell::new(Vec::new());
}

view this post on Zulip Musab Nazir (May 08 2024 at 11:21):

I have another question tangentially related to this. When we use the import sytax to load a file like:

import "sample.txt" as sample : Str

Is it the same as using File.read or is there extra stuff roc is doing in the import version to the file?

view this post on Zulip Anton (May 08 2024 at 11:36):

I believe this causes the file to be read at compile time and included in the executable but I'd have to check to make sure.

view this post on Zulip Musab Nazir (May 08 2024 at 14:23):

Apologies if this isn't the right place for these questions but I'm trying to find the fastest way I can read the raw contents of a file and my current method is showing non-linear runtime as I feed bigger and bigger files.

main =
    startTime = Utc.now!
    input <- File.readBytes (Path.fromStr "measurements.txt") |> Task.attempt
    when input is
        Ok _ ->
            endTime = Utc.now!
            runTime = Utc.deltaAsMillis startTime endTime |> Num.toStr
            Stdout.line! "File read in $(runTime)ms"
        Err _ -> Stdout.line! "Failed to read"

With a 100mil row file (1.5gb) my fastest run was 800ms on the dot
With a 200mil row file (3.0gb) I get 5500ms, I might be running into memory limitations cuz I'm on a 8gig m1 machine.

Is there any low hanging fruit in terms of cutting this time down more?

view this post on Zulip Brendan Hansknecht (May 08 2024 at 14:24):

There are 100%

view this post on Zulip Brendan Hansknecht (May 08 2024 at 14:24):

Pretty sure we are duplicating the read in file

view this post on Zulip Brendan Hansknecht (May 08 2024 at 14:25):

Also, if we add some for a streamed reading. Please make sure it has either a configurable or very large buffer (4k minimum, 16k probably better)

view this post on Zulip Brendan Hansknecht (May 08 2024 at 14:27):

Though personally, I would push for a mmap'ed load directly into a roc list instead. Fast and nice. Don't have to think about buffers at all. Just one magical block (though may not play nice with roc, will explode on the first edit) needs to be specially treated as read only or managed by the platform with more effects.

view this post on Zulip Luke Boswell (May 08 2024 at 14:33):

Can we pass a slice into Roc?

view this post on Zulip Luke Boswell (May 08 2024 at 14:36):

I'm trying to find the fastest way I can read the raw contents of a file and my current method is showing non-linear runtime as I feed bigger and bigger files.

One thing to point out is that I think you will see limitations with basic-cli that is probably just the platform which is pretty primitive, and not necessarily anything specific to roc. So making a fork, or writing another platform implementation that is more suited may be a good idea to consider.

view this post on Zulip Luke Boswell (May 08 2024 at 14:37):

Depending on what is it you're trying to do.

view this post on Zulip Musab Nazir (May 08 2024 at 14:57):

One thing to point out is that I think you will see limitations with basic-cli that is probably just the platform which is pretty primitive, and not necessarily anything specific to roc. So making a fork, or writing another platform implementation that is more suited may be a good idea to consider.

ah right I forget I'm going through a platform. Might look into what basic-cli is doing and see if a local tweak gets me what I'm after.

view this post on Zulip Brendan Hansknecht (May 08 2024 at 15:01):

We can pass slices into roc.

view this post on Zulip Richard Feldman (May 08 2024 at 15:29):

I think we should really avoid mmap in Roc platform I/O...it would introduce the characteristic "any other process can now cause your Roc application code to perform undefined behavior" whereas you don't have that concern if there's no mmap primitive

view this post on Zulip Brendan Hansknecht (May 08 2024 at 17:30):

I don't follow. Just mmap private.

view this post on Zulip Richard Feldman (May 08 2024 at 17:34):

whoa, TIL that MMAP_PRIVATE protects against that! :mind_blown:

view this post on Zulip Richard Feldman (May 08 2024 at 17:35):

ok in that case I guess there's no potential for UB?

view this post on Zulip Brendan Hansknecht (May 08 2024 at 17:41):

Of course, on modification, mmapped slices have an issue in roc. They will be cloned and the original map won't be freed.... Though maybe it could be intercepted if done right....

view this post on Zulip Richard Feldman (May 08 2024 at 17:43):

what if we only offered readonly access?

view this post on Zulip Richard Feldman (May 08 2024 at 17:44):

like the platform just didn't offer write access to any file that's mmap'd

view this post on Zulip Brendan Hansknecht (May 08 2024 at 18:36):

Yeah, would probably do that by offering a seamless slice to the mmap'ed file

view this post on Zulip Brendan Hansknecht (May 08 2024 at 18:36):

Then, if the user tries to write, it would copy the entire thing.


Last updated: Jul 06 2025 at 12:14 UTC