Hey all, there is an issue #6164 that brings up that Str.split and List.split behave somewhat differently and proposes changing List.split to possibly List.splitAt and then introducing a List.split that more closely resembles Str.split. Does anyone have any strong opinions when it comes to this? I personally like the idea of changing List.split.
I am very in favor!
I like the rename for the current List.split to List.splitAt.
For the new List.split implementation, I prefer the List.split : List elem, elem -> List (List elem) signature.
After reading the topic in contributing I also prefer splitBy instead of split.
splitBy or splitOn? :thinking:
By is more common in other languages I think, but On does seem nicer.
A little bit of ambiguity in the name for me, is whether the split happens before or after the matching spot. I always find myself consulting the docs for things like this if I've been away from a language for a while.
Maybe this could also use tags, e.g., List.splitAt (After 5). But this doesn't "sound" that nice
Maybe generic List.split could work with that sort of structure, with more specific function names likeList.splitAfter or List.splitBy not having the tag argument as normal
SplitBy suggest removal of the element. I don't think SplitOn suggests removal
If I "split by" whitespace, I wouldn't expect to see whitespace in the output. If I "split on" whitespace, I would expect it to be captured in a trailing manner with each split.
That is definitely how I parse the wording and I think relatively common
I am in splitOn camp as well I think.
splitBy, splitBefore, and splitAfter?
My preference is List.splitOn
Whichever way it goes, it would be great if the comparable Str and List functions have the same name
I just thought of a new complication:
For each split function, we probably want a version that takes an elem and a version that takes a predicate.
List elem, elem -> List (List elem)
List elem, (elem -> Bool) -> List (List elem)
we probably want a version that takes an elem and a version that takes a predicate.
That seems reasonable at first, but I'm struggling to come up with common enough realistic use cases :thinking:
@Richard Feldman can you make the final decision on splitBy vs splitOn? That way @mdznta can move forward.
I'd like to add a first attempt at parsing multipart/form-data and since basic-webserver would probably prefer not to have List splitting code in the Http module it would be incredibly convenient to move forward with this issue.
If it counts towards anything my vote is on
List.splitAt: List a, U64 -> {before: List a, after: List a}
List.splitOn: List a, List a -> List (List a) where a implements Eq
oh yeah I just realized I never responded to this! I'm in favor of that pairing as well :+1:
So an important distinction here in @Adrian's version is that List.splitOn matches Str.split. It is skipping entire sequences of elements instead of just a single element.
Which I think is fine, but just an important difference from what was talked about above.
oh hm I missed that
We could always be explicit with it:
List.splitAt: List a, U64 -> {before: List a, after: List a}
List.splitOn: List a, a -> List (List a)
List.splitOnSeq: List a, List a -> List (List a)
Is it worth using tags instead of different function variants for this one?
That works to. I think this is a decent proposal:
List.splitAt : List a, U64 -> {before: List a, after: List a}
List.splitOn : List a, [Elem a, Sequence (List a)] -> List (List a)
I would prefer just passing a single element list if you only want to split on an element. It seems nicer than requiring a tag or having two different functions
I personally don't like that cause I really dislike the chains of minor paper cuts that aren't likely to optimize away in api design. Generating a list for a single element then extracting the element to run the more optimized single element split function will not optimize away. There will always be an unnecessary allocation.
The tag, should optimize away when the outer function inlines (which should essentially always happen). Even if it doesn't optimize away, no allocation, just an extra int in a register/on stack.
With strings it matters a lot less cause we have small string optimization (which at a minimum avoids the allocation and likely also optimizes away). We have no plans to add a small list optimization (it is a much more complex mess due to various sized elements and often requires manual tuning to be beneficial).
Gotcha, that makes sense!
Anyway, doesn't matter to me if it is a tag or 2 functions, but I think it is 100% fine for someone to start implementing this. It would be a very minor tweak to switch from one version to the other. So they can write up 99% of the PR rn.
Oh, also, I just had a realization. The fancy algorithm that gets applied with strings can't be applied on lists. You would have to explicitly know the list is a List U8 to use it. As such, I think the linear scan algorithm will be the fastest thing we can safely do.
So I think the impl I wrote here for splitting based on a seq is essentially a valid impl. That said, if written in the standard library, we can rewrite it in a smarter way to avoid more bounds checking. So that will lead to a different design if we want to write the max perf version. I think this should be performant as 100% roc builtin code (bar some future refcounting fixes, but those will come with time and don't need to be worried about now).
Should I also add the appropriate First and Last functions?
This is from my current Utils.roc so names are still subject to change.
listSplitFirstSeq : List a, List a -> { before : List a, after : List a } where a implements Eq
listSplitFirstSeq = \list, seq ->
if List.isEmpty seq then
{ before: [], after: list }
else
listSplitFirstSeqHelper list seq 0 0
listSplitFirstSeqHelper = \list, seq, i, j ->
when (List.get list (i + j), List.get seq j) is
(Ok x, Ok y) if x == y ->
listSplitFirstSeqHelper list seq i (j + 1)
(Ok _, Ok _) ->
listSplitFirstSeqHelper list seq (i + 1) 0
(Ok _, Err _) ->
{
before: List.sublist list { start: 0, len: i },
after: List.sublist list { start: i + j, len: (List.len list) - i - j },
}
(Err _, Ok _) ->
{ before: List.sublist list { start: 0, len: i + j }, after: [] }
(Err _, Err _) ->
{ before: List.sublist list { start: 0, len: i }, after: [] }
expect listSplitFirstSeq ("a/b/c" |> Str.toUtf8) ['/'] |> eq { before: "a" |> Str.toUtf8, after: "b/c" |> Str.toUtf8 }
expect listSplitFirstSeq ("/a/b/c" |> Str.toUtf8) ['/'] |> eq { before: [], after: "a/b/c" |> Str.toUtf8 }
expect listSplitFirstSeq ("abc/" |> Str.toUtf8) ['/'] |> eq { before: "abc" |> Str.toUtf8, after: [] }
expect listSplitFirstSeq ("abc" |> Str.toUtf8) ['/'] |> eq { before: "abc" |> Str.toUtf8, after: [] }
expect listSplitFirstSeq ("a//b//c" |> Str.toUtf8) ['/', '/'] |> eq { before: "a" |> Str.toUtf8, after: "b//c" |> Str.toUtf8 }
expect listSplitFirstSeq ("//a//b//c" |> Str.toUtf8) ['/', '/'] |> eq { before: [], after: "a//b//c" |> Str.toUtf8 }
expect listSplitFirstSeq ("abc//" |> Str.toUtf8) ['/', '/'] |> eq { before: "abc" |> Str.toUtf8, after: [] }
expect listSplitFirstSeq ("abc" |> Str.toUtf8) ['/', '/'] |> eq { before: "abc" |> Str.toUtf8, after: [] }
expect listSplitFirstSeq ("/abc/" |> Str.toUtf8) ['/', '/'] |> eq { before: "/abc/" |> Str.toUtf8, after: [] }
expect listSplitFirstSeq ("abc" |> Str.toUtf8) ['/', '/'] |> eq { before: "abc" |> Str.toUtf8, after: [] }
Someone commented on the github issue for this (#6164), let's finish up the decision here.
The last point of doubt is this standoff:
List.splitOn: List a, a -> List (List a)
List.splitOnSeq: List a, List a -> List (List a)
vs 2. :
List.splitOn : List a, [Elem a, Sequence (List a)] -> List (List a)
I'm heavily in favor of 1. it's just simpler. It doesn't require you to be familiar with tag unions, it's easier to type. I can't find any real benefits of 2 :p
Version 2 is far more elegant and once you abbreviate Sequence to Seq everyone will see it :laughing:
Sorry to restir this conversation, but aren't the roc's compiler and llvm able to optimize out List.splitOnSeq _ [element] to List.splitOn? Could we get away with only one function like in the Str module?
Version 2 is far more elegant ...
Interesting, can you elaborate on that @Francois Green?
but aren't the roc's compiler and llvm able to optimize out
List.splitOnSeq _ [element]toList.splitOn?
That optimization seems plausible, I think it's safe to consider the performance of both equivalent.
Could we get away with only one function like in the Str module?
We definitely could get away with it, can you share why you think it's the best approach?
llvm able to optimize out
List.splitOnSeq _ [element]
Definitely not. That list requires an allocation. LLVM basically gives up on everything once the heap is involved (not actaully true, but not far from it)
roc's compiler ... able to optimize out
List.splitOnSeq _ [element]
We could, but it would be a one off bespoke pattern that affects performance. I am very much not a fan of those.
Brendan Hansknecht said:
We could, but it would be a one off bespoke pattern that affects performance. I am very much not a fan of those.
Sorry but the transformation [const, const, ...] -> ListConst doesn't seem like a one off to me. In general I would prefer to avoid allocations if possible so transforming List.single a -> a seeems implementable (perhaps i am mistken about the innerworkings of the compiler) but if lists are mostly passed as slices, than you can construct a list like wrapper around a single element. In general i would experiment with const computation, tho that might get expansive for compile times (and might not avoid as many allocations as i might envision).
Anton said:
We definitely could get away with it, can you share why you think it's the best approach?
I would prefer List.splitOn to match Str.splitOn. I think of Str type like a thin wrapper around the List U8 without the direct access to elements to prevent messing up unicode codepoints, so i would like to see as many parallels between the two (it would also be easier to remember).
I would prefer
List.splitOnto matchStr.splitOn.
There is a case to be made for consistency. I would expect that most of the time the user would want to split a list by a single element. I would not prefer consistency over accommodating the common use case.
A single element list can be made with 2 extra characters. Is it accommodating to skip 2 chars? Why do you need to create a new string if you're spliting on "/"? Sorry but i might be missing your point
@Anton Version 2 gives me function overloading in a manner that embraces a key feature of the language. Maybe it's too clever by half, but I find it a pleasure to read.
Going from List.splitOnSeq _ [element] to List.splitOn would be a special transformation specifically for List.splitOnSeq that generates a new function that never builds the original list. This is not general.
Sorry but the transformation
[const, const, ...] -> ListConstdoesn't seem like a one off to me
These seem like a different problem. Sure, we have constant lists, but they still manifest in a different section of the binary with a pointer indirection. Looking at List.splitOnSeq _ [element] there is no requirement for element to be a constant. As such, this code should be expected to allocate.
A single element list can be made with 2 extra characters
Yes, and it should be assumed to require an allocation. We don't have a small list optimization (and we don't have plans to add one).
Why do you need to create a new string if you're spliting on "/"?
Because the standard library Str has restrictions related to unicode. We don't have a character* type to guarantee valid unicode input. So requiring a Str as input is what makes most sense for that api. It guarantees we don't split in a way that generates invalid unicode.
* warning: "character" is a very loaded word when talking about unicode.
If List.split is to be renamed List.splitAt, can the decision to return a record and not a tuple also be reevaluated?
Sure, can you start a new topic (in #ideas) for that?
Considering that the options from 1st option won't be equal for a long time and that most people voted for number 1, I see no issues blocking the implementation being added to stdlib.
Anton said:
List.splitOn : List a, [Elem a, Sequence (List a)] -> List (List a)
If Element is abbreviated to Elem then i believe Sequence also should be abbreviated to Seq, or use full words
I think it will look like this from the API perspective?
List.splitOn list (Elem 42)
List.splitOn list (Seq [1, 2, 3])
vs
List.splitOn list 42
List.splitOnSeq list [1, 2, 3]
Also, what’s the advantage of the tag union? How often Elem and Seq be used not directly with the List.splitOn call?
I'm not sure what is current "policy" for duplicate functions in builtins, but maybe it would be beneficial to have all those signatures?
List.splitOnElem: List a, a -> List (List a)
List.splitOnSeq: List a, List a -> List (List a)
List.splitOn : List a, [Elem a, Seq (List a)] -> List (List a)
Let's see how things go with the solution we voted on and we can revisit later if needed.
Anton has marked this topic as resolved.
Are these going to be implemented in pure Roc, or are we planning on using something like zig's std.mem.splitSequence?
I needed this, so I made something and shared in an Issue
https://github.com/roc-lang/roc/issues/6912
If it looks ok maybe we can add to the builtins?
what if we named it List.splitOnList instead of List.splitOnSeq? My first thought when I saw the name (after having not thought about this thread for awhile) was "so I have to give it a Seq? How do I get one of those?" :big_smile:
I'd like to also rename Str.split to Str.splitOn so that it is consistent with List.splitOn.
For context, once this PR merges, we'll have these list splitting functions:
splitAt : List elem, U64 -> { before : List elem, others : List elem }
splitOn : List a, a -> List (List a) where a implements Eq
splitOnList : List a, List a -> List (List a) where a implements Eq
Last updated: Jun 16 2026 at 16:19 UTC