Stream: beginners

Topic: Dataframe Library


view this post on Zulip Luke Boswell (Jun 05 2023 at 10:24):

I was talking to a friend about roc-pg and the topic came up that the world needs a type-safe data frame library. Tbh I've never really used a data frame lirbary (e.g. Python's pandas) so this question may not make a lot of sense. But, from what was explained tome it sounded like a use-case that Roc could be well-suited for.

Is there anything fundamental that would prevent Roc from having a really ergonomic data frame experience in future?

view this post on Zulip Luke Boswell (Jun 05 2023 at 10:25):

I guess the basic idea is an opaque type that stores 2D data, supports operations like grouping, filtering, etc to produce new data frames.

view this post on Zulip Richard Feldman (Jun 05 2023 at 10:36):

Luke Boswell said:

I've never really used a data frame lirbary (e.g. Python's pandas)

same here...I'm curious what others think! :big_smile:

view this post on Zulip Anton (Jun 05 2023 at 11:38):

the world needs a type-safe data frame library

Yes, I've thought this too back when I was working a lot with pandas.
This also seems like a good fit for plugins.

The way I imagine that this could work is that you have an LLM detect appropriate types for each column and have it generate custom parsers where necessary to convert the strings to those types. A plugin could be used to handle the data that fails to parse, it could be corrected, labeled as missing, or the whole row could be deleted.

pandas does not make good use of the available hardware, it's not multi-threaded while dataframe operations are often embarrassingly parallel. pandas also does not handle the available memory intelligently, it will gladly spend a long time filling up your memory while it could easily predict that the dataframe it's constructing from a file will not fit in your memory.

With the potential for speed and efficiency in Roc as well as our attention for UX we could indeed make an excellent dataframe library.

view this post on Zulip Bryce Miller (Jun 05 2023 at 12:18):

I’ve used dataframes in R. I agree that Roc seems well suited to this use case.

view this post on Zulip Bryce Miller (Jun 05 2023 at 12:19):

The FP features in R were really nice for this. Namely: immutability and pipes.

view this post on Zulip Bryce Miller (Jun 05 2023 at 12:27):

And if we could have tight editor integration that would also be killer.

view this post on Zulip Ajai Nelson (Jun 05 2023 at 12:36):

Yes, I totally agree! If people want to learn about data frames, I recommend looking at dplyr, which is very popular in R and has a nice functional API. (R has its quirks, but I think people underestimate it.)

Also, here’s a paper I found interesting that describes some of the challenges of typing data frames. They made a “benchmark for table types” that includes common data frame operations so people can compare how well their language can support typing them.

view this post on Zulip Joe Giralt (Jun 05 2023 at 16:42):

elixir has a pretty nice, functional dataframe library.
Explorer took a lot from dplyr.
https://hexdocs.pm/explorer/Explorer.html

view this post on Zulip Brendan Hansknecht (Jun 05 2023 at 17:36):

One thing i find interesting here is that a lot of data framed are noisy in terms of types. That is what makes them hard to work with. Naturally, i think dynamic languages are well suited in that they often can flexibly do things with data of different types. That said, often they do the wrong thing or crash due to wrong types. A dynamic language that properly had hooks to help you deal with wrong types probably would be very well suited.

That said, with a simple tag union type and doing the work to wrap everything, a typed language can do quite well. I think it has some wins and some losses. Obvious win is that it forces you to deal with all possible types. Obvious loss, dealing with all types is a pain and often a big loss in terms of perf. A jited dynamic language can optimize for the single expected type and essential recover via exceptions if it hits an unexpected type. That can be a huge perf gain if your data through certain functions is mostly regular.

I feel like true messy data frames are case where you both do and don't want types. But this is from pretty old and limited experience.

view this post on Zulip Luke Boswell (Jun 06 2023 at 06:28):

Thank you for these ideas. Looks like I have some research to do :laughing:

view this post on Zulip Hannes (Jun 06 2023 at 11:50):

I've also thought about what a dataframes library would look like in Roc. I don't think there's been a dataframes library that takes advantage of static types yet, but I think there's reasons why it hasn't been attempted/suceeded.

I wrote a long and rambling thing about dataframe libraries here, read at your own peril

Having said all that, I would still like to see what a dataframes library in Roc would look like. The key thing I'd like is to take advantage of Roc's type checker to typecheck the types of the columns, I believe this is something that not even polars does, the only example I know of is Julia's TypedTables.jl, but Julia can't be statically type checked, so it doesn't take advantage of this fact.

I thought about trying to rewrite one of my analyses in Roc using a list of records as a kind of basic dataframe to see how far I could get, but the first step in this analysis was to loop over every column to rename it, which wouldn't be possible with that data structure. Instead, this morning I looked at the API for Ocaml's Owl and started writing my own library in Roc. I made a Dataframe type which is an opaque wrapper for a list of series, and a series is a union of a bunch of different series types, and each series is an opaque wrapper around a list and a string for the series' name. I used a Python script to generate each series type. I gave up after trying to write an example app using the library I wrote and getting some compiler errors that I didn't understand.

Here's the repo for my experiment: https://github.com/Hasnep/roc-dataframes

view this post on Zulip Hannes (Jun 06 2023 at 12:07):

Oh, my brother just pointed out to me that implementing the Apache Arrow spec would probably be the best way of writing a dataframe library in Roc, but I believe Arrow's dataframes are dynamically typed, that's why polars doesn't typecheck the columns, so I think the experience of using Arrow in Roc would be just like a more verbose version of Elixir's Explorer.

view this post on Zulip Anton (Jun 06 2023 at 13:11):

It's so incredibly inconsistent and unergonomic, I have to look up the syntax for every operation each time I use it.

:100:

view this post on Zulip Anton (Jun 06 2023 at 13:13):

Tangentially related; data analysis with chatGPT:
https://www.youtube.com/watch?v=b9hSCuFGNRU

view this post on Zulip Ajai Nelson (Jun 06 2023 at 15:47):

@Hannes Your comment makes a lot of sense to me. Just to add to the part about it being really hard, my vague understanding is that type-safe data frame libraries are still an open research problem in a lot of ways. I suspect a really good one would require a more powerful type system than Roc currently has.

Also, I don’t have any real experience using pandas, but I’m so glad to hear others confirm my feelings about its API! I don’t understand how it hasn’t been replaced by a nicer wrapper or something. Don’t even get me started on matplotlib (but again, not much experience).

view this post on Zulip Bryce Miller (Jun 06 2023 at 21:57):

Even if Roc had a dataframe library that didn't provide any typing advantages over the alternatives in other languages, I still think it would be better than not having one at all. I didn't use dplyr for data science stuff at all. I just used it to help me ingest product data at my e-commerce job because our PIM was terrible. So if someone were building an e-commerce platform, ERP, BI tool, etc., they could probably take advantage of such a library. And it would probably be painful to have to maintain a custom Roc platform just to gain access to fast data transformations. But perhaps for those use cases a custom platform would make sense anyway.

view this post on Zulip Bryce Miller (Jun 06 2023 at 22:00):

So projects using Roc because it's a good fit, that also happen to need to do some data transformation and analysis.

view this post on Zulip Adam Forbes (Jun 09 2023 at 13:08):

The Elixir library mention uses Polars https://www.pola.rs/ under the hood (a rust library)


Last updated: Jul 05 2025 at 12:14 UTC