✔ Change List.split to List.splitAt? · ideas

Hey all, there is an issue #6164 that brings up that Str.split and List.split behave somewhat differently and proposes changing List.split to possibly List.splitAt and then introducing a List.split that more closely resembles Str.split. Does anyone have any strong opinions when it comes to this? I personally like the idea of changing List.split.

Isaac Van Doren (Dec 08 2023 at 03:31):

Anton (Dec 08 2023 at 10:16):

I like the rename for the current List.split to List.splitAt.
For the new List.split implementation, I prefer the List.split : List elem, elem -> List (List elem) signature.

Anton (Dec 08 2023 at 11:22):

Richard Feldman (Dec 08 2023 at 11:56):

Anton (Dec 08 2023 at 12:01):

Declan Joseph Maguire (Dec 08 2023 at 12:32):

A little bit of ambiguity in the name for me, is whether the split happens before or after the matching spot. I always find myself consulting the docs for things like this if I've been away from a language for a while.

LoipesMas (Dec 08 2023 at 12:35):

Maybe this could also use tags, e.g., List.splitAt (After 5). But this doesn't "sound" that nice

Declan Joseph Maguire (Dec 08 2023 at 12:37):

Maybe generic List.split could work with that sort of structure, with more specific function names likeList.splitAfter or List.splitBy not having the tag argument as normal

Brendan Hansknecht (Dec 08 2023 at 15:53):

SplitBy suggest removal of the element. I don't think SplitOn suggests removal

Brendan Hansknecht (Dec 08 2023 at 15:54):

If I "split by" whitespace, I wouldn't expect to see whitespace in the output. If I "split on" whitespace, I would expect it to be captured in a trailing manner with each split.

Brendan Hansknecht (Dec 08 2023 at 15:55):

mdznta (Dec 08 2023 at 23:05):

Sky Rose (Dec 09 2023 at 02:01):

Isaac Van Doren (Dec 09 2023 at 03:07):

Isaac Van Doren (Dec 09 2023 at 03:08):

Whichever way it goes, it would be great if the comparable Str and List functions have the same name

Sky Rose (Dec 09 2023 at 16:19):

I just thought of a new complication:
For each split function, we probably want a version that takes an elem and a version that takes a predicate.
List elem, elem -> List (List elem)
List elem, (elem -> Bool) -> List (List elem)

Anton (Dec 10 2023 at 12:19):

That seems reasonable at first, but I'm struggling to come up with common enough realistic use cases :thinking:

Anton (Dec 12 2023 at 11:47):

@Richard Feldman can you make the final decision on splitBy vs splitOn? That way @mdznta can move forward.

Adrian (Apr 11 2024 at 21:06):

I'd like to add a first attempt at parsing multipart/form-data and since basic-webserver would probably prefer not to have List splitting code in the Http module it would be incredibly convenient to move forward with this issue.
If it counts towards anything my vote is on

List.splitAt: List a, U64 -> {before: List a, after: List a}
List.splitOn: List a, List a -> List (List a) where a implements Eq

Richard Feldman (Apr 11 2024 at 23:01):

oh yeah I just realized I never responded to this! I'm in favor of that pairing as well :+1:

Brendan Hansknecht (Apr 11 2024 at 23:01):

So an important distinction here in @Adrian's version is that List.splitOn matches Str.split. It is skipping entire sequences of elements instead of just a single element.

Brendan Hansknecht (Apr 11 2024 at 23:01):

Which I think is fine, but just an important difference from what was talked about above.

Richard Feldman (Apr 11 2024 at 23:02):

Brendan Hansknecht (Apr 11 2024 at 23:38):

List.splitAt: List a, U64 -> {before: List a, after: List a}
List.splitOn: List a, a -> List (List a)
List.splitOnSeq: List a, List a -> List (List a)

Kevin Gillette (Apr 12 2024 at 00:36):

Brendan Hansknecht (Apr 12 2024 at 01:10):

List.splitAt : List a, U64 -> {before: List a, after: List a}
List.splitOn : List a, [Elem a, Sequence (List a)] -> List (List a)

Isaac Van Doren (Apr 12 2024 at 01:55):

I would prefer just passing a single element list if you only want to split on an element. It seems nicer than requiring a tag or having two different functions

Brendan Hansknecht (Apr 12 2024 at 01:59):

I personally don't like that cause I really dislike the chains of minor paper cuts that aren't likely to optimize away in api design. Generating a list for a single element then extracting the element to run the more optimized single element split function will not optimize away. There will always be an unnecessary allocation.

Brendan Hansknecht (Apr 12 2024 at 01:59):

The tag, should optimize away when the outer function inlines (which should essentially always happen). Even if it doesn't optimize away, no allocation, just an extra int in a register/on stack.

Brendan Hansknecht (Apr 12 2024 at 02:00):

With strings it matters a lot less cause we have small string optimization (which at a minimum avoids the allocation and likely also optimizes away). We have no plans to add a small list optimization (it is a much more complex mess due to various sized elements and often requires manual tuning to be beneficial).

Isaac Van Doren (Apr 12 2024 at 02:04):

Brendan Hansknecht (Apr 12 2024 at 02:15):

Anyway, doesn't matter to me if it is a tag or 2 functions, but I think it is 100% fine for someone to start implementing this. It would be a very minor tweak to switch from one version to the other. So they can write up 99% of the PR rn.

Brendan Hansknecht (Apr 12 2024 at 02:27):

Oh, also, I just had a realization. The fancy algorithm that gets applied with strings can't be applied on lists. You would have to explicitly know the list is a List U8 to use it. As such, I think the linear scan algorithm will be the fastest thing we can safely do.

So I think the impl I wrote here for splitting based on a seq is essentially a valid impl. That said, if written in the standard library, we can rewrite it in a smarter way to avoid more bounds checking. So that will lead to a different design if we want to write the max perf version. I think this should be performant as 100% roc builtin code (bar some future refcounting fixes, but those will come with time and don't need to be worried about now).

Adrian (Apr 12 2024 at 07:35):

Should I also add the appropriate First and Last functions?
This is from my current Utils.roc so names are still subject to change.

listSplitFirstSeq : List a, List a -> { before : List a, after : List a } where a implements Eq
listSplitFirstSeq = \list, seq ->
    if List.isEmpty seq then
        { before: [], after: list }
    else
        listSplitFirstSeqHelper list seq 0 0

listSplitFirstSeqHelper = \list, seq, i, j ->
    when (List.get list (i + j), List.get seq j) is
        (Ok x, Ok y) if x == y ->
            listSplitFirstSeqHelper list seq i (j + 1)

        (Ok _, Ok _) ->
            listSplitFirstSeqHelper list seq (i + 1) 0

        (Ok _, Err _) ->
            {
                before: List.sublist list { start: 0, len: i },
                after: List.sublist list { start: i + j, len: (List.len list) - i - j },
            }

        (Err _, Ok _) ->
            { before: List.sublist list { start: 0, len: i + j }, after: [] }

        (Err _, Err _) ->
            { before: List.sublist list { start: 0, len: i }, after: [] }

expect listSplitFirstSeq ("a/b/c" |> Str.toUtf8) ['/'] |> eq { before: "a" |> Str.toUtf8, after: "b/c" |> Str.toUtf8 }
expect listSplitFirstSeq ("/a/b/c" |> Str.toUtf8) ['/'] |> eq { before: [], after: "a/b/c" |> Str.toUtf8 }
expect listSplitFirstSeq ("abc/" |> Str.toUtf8) ['/'] |> eq { before: "abc" |> Str.toUtf8, after: [] }
expect listSplitFirstSeq ("abc" |> Str.toUtf8) ['/'] |> eq { before: "abc" |> Str.toUtf8, after: [] }

expect listSplitFirstSeq ("a//b//c" |> Str.toUtf8) ['/', '/'] |> eq { before: "a" |> Str.toUtf8, after: "b//c" |> Str.toUtf8 }
expect listSplitFirstSeq ("//a//b//c" |> Str.toUtf8) ['/', '/'] |> eq { before: [], after: "a//b//c" |> Str.toUtf8 }
expect listSplitFirstSeq ("abc//" |> Str.toUtf8) ['/', '/'] |> eq { before: "abc" |> Str.toUtf8, after: [] }
expect listSplitFirstSeq ("abc" |> Str.toUtf8) ['/', '/'] |> eq { before: "abc" |> Str.toUtf8, after: [] }
expect listSplitFirstSeq ("/abc/" |> Str.toUtf8) ['/', '/'] |> eq { before: "/abc/" |> Str.toUtf8, after: [] }
expect listSplitFirstSeq ("abc" |> Str.toUtf8) ['/', '/'] |> eq { before: "abc" |> Str.toUtf8, after: [] }

Anton (Jun 29 2024 at 11:58):

Someone commented on the github issue for this (#6164), let's finish up the decision here.

List.splitOn: List a, a -> List (List a)
List.splitOnSeq: List a, List a -> List (List a)

List.splitOn : List a, [Elem a, Sequence (List a)] -> List (List a)

Anton (Jun 29 2024 at 11:58):

I'm heavily in favor of 1. it's just simpler. It doesn't require you to be familiar with tag unions, it's easier to type. I can't find any real benefits of 2 :p

Francois Green (Jun 29 2024 at 14:58):

Version 2 is far more elegant and once you abbreviate Sequence to Seq everyone will see it :laughing:

Adrian (Jun 29 2024 at 15:44):

Sorry to restir this conversation, but aren't the roc's compiler and llvm able to optimize out List.splitOnSeq _ [element] to List.splitOn? Could we get away with only one function like in the Str module?

Anton (Jun 29 2024 at 16:07):

Anton (Jun 29 2024 at 16:14):

That optimization seems plausible, I think it's safe to consider the performance of both equivalent.

We definitely could get away with it, can you share why you think it's the best approach?

Brendan Hansknecht (Jun 29 2024 at 17:03):

Definitely not. That list requires an allocation. LLVM basically gives up on everything once the heap is involved (not actaully true, but not far from it)

We could, but it would be a one off bespoke pattern that affects performance. I am very much not a fan of those.

Adrian (Jun 29 2024 at 17:18):

Sorry but the transformation [const, const, ...] -> ListConst doesn't seem like a one off to me. In general I would prefer to avoid allocations if possible so transforming List.single a -> a seeems implementable (perhaps i am mistken about the innerworkings of the compiler) but if lists are mostly passed as slices, than you can construct a list like wrapper around a single element. In general i would experiment with const computation, tho that might get expansive for compile times (and might not avoid as many allocations as i might envision).

Adrian (Jun 29 2024 at 17:30):

I would prefer List.splitOn to match Str.splitOn. I think of Str type like a thin wrapper around the List U8 without the direct access to elements to prevent messing up unicode codepoints, so i would like to see as many parallels between the two (it would also be easier to remember).

Anton (Jun 29 2024 at 18:02):

There is a case to be made for consistency. I would expect that most of the time the user would want to split a list by a single element. I would not prefer consistency over accommodating the common use case.

Adrian (Jun 29 2024 at 18:58):

A single element list can be made with 2 extra characters. Is it accommodating to skip 2 chars? Why do you need to create a new string if you're spliting on "/"? Sorry but i might be missing your point

Francois Green (Jun 29 2024 at 20:30):

@Anton Version 2 gives me function overloading in a manner that embraces a key feature of the language. Maybe it's too clever by half, but I find it a pleasure to read.

Brendan Hansknecht (Jun 29 2024 at 20:43):

Going from List.splitOnSeq _ [element] to List.splitOn would be a special transformation specifically for List.splitOnSeq that generates a new function that never builds the original list. This is not general.

These seem like a different problem. Sure, we have constant lists, but they still manifest in a different section of the binary with a pointer indirection. Looking at List.splitOnSeq _ [element] there is no requirement for element to be a constant. As such, this code should be expected to allocate.

Brendan Hansknecht (Jun 29 2024 at 20:45):

Yes, and it should be assumed to require an allocation. We don't have a small list optimization (and we don't have plans to add one).

Brendan Hansknecht (Jun 29 2024 at 20:47):

Because the standard library Str has restrictions related to unicode. We don't have a character* type to guarantee valid unicode input. So requiring a Str as input is what makes most sense for that api. It guarantees we don't split in a way that generates invalid unicode.

Francois Green (Jul 01 2024 at 12:47):

If List.split is to be renamed List.splitAt, can the decision to return a record and not a tuple also be reevaluated?

Anton (Jul 01 2024 at 12:58):

Adrian (Jul 01 2024 at 23:59):

Considering that the options from 1st option won't be equal for a long time and that most people voted for number 1, I see no issues blocking the implementation being added to stdlib.

Chris (Jul 03 2024 at 06:58):

If Element is abbreviated to Elem then i believe Sequence also should be abbreviated to Seq, or use full words

Kiryl Dziamura (Jul 03 2024 at 07:08):

List.splitOn list (Elem 42)
List.splitOn list (Seq [1, 2, 3])

List.splitOn list 42
List.splitOnSeq list [1, 2, 3]

Also, what’s the advantage of the tag union? How often Elem and Seq be used not directly with the List.splitOn call?

Chris (Jul 03 2024 at 07:16):

I'm not sure what is current "policy" for duplicate functions in builtins, but maybe it would be beneficial to have all those signatures?

List.splitOnElem: List a, a -> List (List a)
List.splitOnSeq: List a, List a -> List (List a)
List.splitOn : List a, [Elem a, Seq (List a)] -> List (List a)

Anton (Jul 03 2024 at 10:30):

Let's see how things go with the solution we voted on and we can revisit later if needed.

Notification Bot (Jul 03 2024 at 10:30):

Luke Boswell (Jul 18 2024 at 05:45):

Are these going to be implemented in pure Roc, or are we planning on using something like zig's std.mem.splitSequence?

Luke Boswell (Jul 18 2024 at 07:00):

Richard Feldman (Jul 18 2024 at 11:14):

what if we named it List.splitOnList instead of List.splitOnSeq? My first thought when I saw the name (after having not thought about this thread for awhile) was "so I have to give it a Seq? How do I get one of those?" :big_smile:

Isaac Van Doren (Nov 14 2024 at 03:34):

I'd like to also rename Str.split to Str.splitOn so that it is consistent with List.splitOn.

Stream: ideas

Topic: ✔ Change List.split to List.splitAt?

mdznta (Dec 08 2023 at 03:21):

Isaac Van Doren (Dec 08 2023 at 03:31):

Anton (Dec 08 2023 at 10:16):

Anton (Dec 08 2023 at 11:22):

Richard Feldman (Dec 08 2023 at 11:56):

Anton (Dec 08 2023 at 12:01):

Declan Joseph Maguire (Dec 08 2023 at 12:32):

LoipesMas (Dec 08 2023 at 12:35):

Declan Joseph Maguire (Dec 08 2023 at 12:37):

Brendan Hansknecht (Dec 08 2023 at 15:53):

Brendan Hansknecht (Dec 08 2023 at 15:54):

Brendan Hansknecht (Dec 08 2023 at 15:55):

mdznta (Dec 08 2023 at 23:05):

Sky Rose (Dec 09 2023 at 02:01):

Isaac Van Doren (Dec 09 2023 at 03:07):

Isaac Van Doren (Dec 09 2023 at 03:08):

Sky Rose (Dec 09 2023 at 16:19):

Anton (Dec 10 2023 at 12:19):

Anton (Dec 12 2023 at 11:47):

Adrian (Apr 11 2024 at 21:06):

Richard Feldman (Apr 11 2024 at 23:01):

Brendan Hansknecht (Apr 11 2024 at 23:01):

Brendan Hansknecht (Apr 11 2024 at 23:01):

Richard Feldman (Apr 11 2024 at 23:02):

Brendan Hansknecht (Apr 11 2024 at 23:38):

Kevin Gillette (Apr 12 2024 at 00:36):

Brendan Hansknecht (Apr 12 2024 at 01:10):

Isaac Van Doren (Apr 12 2024 at 01:55):

Brendan Hansknecht (Apr 12 2024 at 01:59):

Brendan Hansknecht (Apr 12 2024 at 01:59):

Brendan Hansknecht (Apr 12 2024 at 02:00):

Isaac Van Doren (Apr 12 2024 at 02:04):

Brendan Hansknecht (Apr 12 2024 at 02:15):

Brendan Hansknecht (Apr 12 2024 at 02:27):

Adrian (Apr 12 2024 at 07:35):

Anton (Jun 29 2024 at 11:58):

Anton (Jun 29 2024 at 11:58):

Francois Green (Jun 29 2024 at 14:58):

Adrian (Jun 29 2024 at 15:44):

Anton (Jun 29 2024 at 16:07):

Anton (Jun 29 2024 at 16:14):

Brendan Hansknecht (Jun 29 2024 at 17:03):

Adrian (Jun 29 2024 at 17:18):

Adrian (Jun 29 2024 at 17:30):

Anton (Jun 29 2024 at 18:02):

Adrian (Jun 29 2024 at 18:58):

Francois Green (Jun 29 2024 at 20:30):

Brendan Hansknecht (Jun 29 2024 at 20:43):

Brendan Hansknecht (Jun 29 2024 at 20:45):

Brendan Hansknecht (Jun 29 2024 at 20:47):

Francois Green (Jul 01 2024 at 12:47):

Anton (Jul 01 2024 at 12:58):

Adrian (Jul 01 2024 at 23:59):

Chris (Jul 03 2024 at 06:58):

Kiryl Dziamura (Jul 03 2024 at 07:08):

Chris (Jul 03 2024 at 07:16):

Anton (Jul 03 2024 at 10:30):

Notification Bot (Jul 03 2024 at 10:30):

Luke Boswell (Jul 18 2024 at 05:45):

Luke Boswell (Jul 18 2024 at 07:00):

Richard Feldman (Jul 18 2024 at 11:14):

Isaac Van Doren (Nov 14 2024 at 03:34):