pattern matching strings with rest · ideas

spin off of #ideas>Should we have Str.walkUtf8*?
With lists, it is quite common to use pattern matching with recursion to write parsers. Ex:

fn = \list, sum ->
    when list is
        ['0', .. as rest] -> fn rest sum
        ['1', .. as rest] -> fn rest (sum + 1)
        ...
        ['9', .. as rest] -> fn rest (sum + 9)
        _ -> sum

This is currently not possible to do with pattern matching on strings.
That said, we have the primitives to enable a user to write this manually via Str.startsWith and Str.splitFirst or Str.replaceFirst.
Why not enable direct pattern matching on Str with rest?

I am not sure the right syntax here, but it should enable matching the start, end, and getting the rest.
Here is a port of the same code in a random syntax:

fn = \list, sum ->
    when list is
        "0" with .. as rest -> fn rest sum
        "1" with .. as rest -> fn rest (sum + 1)
        ...
        "9" with .. as rest -> fn rest (sum + 9)
        _ -> sum

I really don't know a good syntax for this, but I think it would be a really nice add to the language if we can find a syntax.

Would enable recursive pattern matching parsing that supports arbitrary unicode and small strings without an allocation.

Brendan Hansknecht (Dec 28 2023 at 16:08):

Also would mean you can directly match "😀" with .. as rest instead of [0x01, 0xF6, 0x00, .. as rest]

Richard Feldman (Dec 28 2023 at 16:12):

one idea that sounds ergonomic (but potentially tricky to implement) would be to allow interpolation in string patterns, e.g.:

when str is
    "blah\(rest)" -> ...

Isaac Van Doren (Dec 28 2023 at 17:21):

Richard Feldman (Dec 28 2023 at 18:03):

Richard Feldman (Dec 28 2023 at 18:04):

when str is
    "foo\(bar)baz\(blah)" -> ...
    "blah\(rest)" -> ...

at that point it's partway to running a series of regular expressions on each string

Richard Feldman (Dec 28 2023 at 18:06):

which on its own is doable (although certainly more complex than what we have today) but then what about exhaustiveness checking?

Richard Feldman (Dec 28 2023 at 18:07):

also there are various edge cases which should probably generate warnings, e.g. here \(y) would always match empty string because \(x) would have consumed everything

when str is
    "\(x)\(y)" -> ...

Richard Feldman (Dec 28 2023 at 18:08):

regarding exhaustiveness, "\(x)" -> on its own would be equivalent to x ->, so that's exhaustive

Richard Feldman (Dec 28 2023 at 18:09):

but then also any of the "\(x)\(y)\(z)" -> variations would be exhaustive too (but with warnings about unused identifiers)

Richard Feldman (Dec 28 2023 at 18:10):

I guess actually exhaustiveness wouldn't be that tricky, in the sense that as soon as you have any non-interpolation content in the string literal, it's no longer exhaustive

Agus Zubiaga (Dec 28 2023 at 18:17):

It's pretty cool, but I wonder how often it'd be actually useful. I have a feeling people will start using this, realize it's not quite enough for their use case, and demand full regex support :upside_down:

Hannes Nevalainen (Dec 28 2023 at 22:08):

Hannes Nevalainen (Dec 28 2023 at 22:10):

The way this works in erlang/Elixir is quite powerful and maybe be worth looking at for inspiration :)

Brendan Hansknecht (Dec 28 2023 at 22:19):

<> is the concat operator or special syntax for the pattern matching specifically

Brendan Hansknecht (Dec 28 2023 at 22:21):

If concat "one" <> rest would be equivalent to "one\(rest)" in current roc. So that would suggest using interpolation for variable binding in roc.

Ayaz Hafiz (Dec 28 2023 at 22:57):

Yeah, I would definitely save the string pattern matching to regexes or weakened form that's syntax sugar over list pattern matching

Ayaz Hafiz (Dec 28 2023 at 22:58):

Brendan Hansknecht (Dec 28 2023 at 22:59):

I think a good start that could be expanded later if we want is just matching the prefix, rest, and suffix

Richard Feldman (Dec 29 2023 at 02:46):

one way it could be used: for matching URLs, it could save splitting the string on "/" first:

["users", username, "posts", postId] ->

"/users/\(username)/posts/\(postId)" ->

Richard Feldman (Dec 29 2023 at 02:47):

also could facilitate breaking down URLs into pieces and matching on separate parts separately, e.g.

"/users/\(username)/\(rest)" ->
    when rest is
        "posts/\(postId)" -> ...
        "profile" -> ...

Richard Feldman (Dec 29 2023 at 03:13):

heh, the backslash-escapes for interpolation look awkward there, but if we had a different interpolation syntax e.g...

"/users/${username}/posts/${postId}" -> ...
"/users/${username}/${rest}" ->
    when rest is
        "comments/${commentId}" -> ...
        "profile" -> ...

Kevin Gillette (Dec 29 2023 at 03:19):

fn = \str, sum ->
    when str is
        "0" ++ rest -> fn rest sum
        "1" ++ rest -> fn rest (sum + 1)
        ...
        "9" ++ rest -> fn rest (sum + 9)
        _ -> sum

Brendan Hansknecht (Dec 29 2023 at 03:33):

Is there an inherent advantage of that over using interpolation? I think it boils down to the same thing.

Hannes Nevalainen (Dec 29 2023 at 06:30):

The same operator is used for concatenation of strings (binaries). It can only be used last in a pattern match though so it is like a “rest” param.

The cool thing is that you can pattern match and bind variables on the bytes. “Extract a 2-byte integer to variable a, followed by the byte value 57 and bind the rest to the variable b”

Hannes Nevalainen (Dec 29 2023 at 09:35):

Sven van Caem (Dec 29 2023 at 18:15):

Wouldn't this leave room for ambiguities, like:
"/users/foo/bar/baz/"
could match into
username = "foo"
rest = "bar/baz"
or
username = "foo/bar"
rest = "baz"

Sven van Caem (Dec 29 2023 at 18:17):

you could make rules to make one of these the "principle" one but that doesn't feel satisfying imo

Richard Feldman (Dec 29 2023 at 18:17):

it's a good question - I think it would have to be a greedy match (so, it would always match username = "foo")

Sven van Caem (Dec 29 2023 at 18:22):

If I'd have to pick I'd probably land on that too, but if I encountered this feature for the first time I feel like I'd have to google or test this behavior before using it-- and that's if the potential ambiguity even occurred to me

Sven van Caem (Dec 29 2023 at 18:24):

but then again I've never used Erlang/Elixr or another language that already has this so this might be the unfamiliarity talking

Kevin Gillette (Dec 31 2023 at 00:34):

i think my preference would be guarantees that converting a string to an list of bytes, codepoints, or grapheme clusters would be be especially convenient when pattern matching. That way we could use the already effective list matching syntax while remaining confident about performance.

Generally, converting a Str to List U8 should just be a refcount increment, but for other key functions, given something like...

when Str.graphemes str is
  ["a", "b", ..] -> 1
  ["a", ..] -> 2
  ["c", ..] -> 3

I would hope (in the eventual future) that the above would not actually split the whole string into graphemes, but would instead decode at most the first two graphemes.

I would hope the same would apply with pattern matching on the result of Str.toScalars as well.

Brendan Hansknecht (Dec 31 2023 at 00:36):

As for graphemes, while theoretically doable, hard (impossible?) to make generically apply (though maybe it only needs to apply for common builtins)

Kevin Gillette (Dec 31 2023 at 00:37):

Okay, fair. For strings small enough to have special handling, the copying cost isn't too consequential, I guess

Kevin Gillette (Dec 31 2023 at 00:38):

And I'm not suggesting this be a generic thing. Literally just optimization support for just those 2-3 Str functions with pattern matching as a roadmap item.

Brendan Hansknecht (Dec 31 2023 at 00:39):

Ah, yeah. That should be doable with some grit to update the pattern matching code for variable lengths

Isaac Van Doren (Dec 31 2023 at 00:47):

Typing out a pattern as a list is significantly more annoying to me than being able to just type a string. Using lists also stops you from being able to copy and paste existing strings for use in a pattern.

Kevin Gillette (Dec 31 2023 at 01:07):

Using interpolation to pull data back out of strings seems neat but also not great to me. The CUE language has identical interpolation syntax, and is also quite focused on pattern matching, including matching based off of interpolated strings.

Because of how it works, I think CUE could get there (i.e. match on "\(a)+\(b)" and then constrain a and b in different ways, such as based on a regex). It's very complex to achieve, and it's not there yet, and it might be that they decide to not support that kind of transitive-constraint string matching at all.

Conversely, Roc is not a constraint language, and so can't rely on an accumulation of rules to give the matching fidelity some people might hope for. Given that conversations about interpretation-based formatting have been reasonably non-committal, I don't see Roc adding, e.g. richer interpolation syntax to say: match "\(x)\(y)" but x must be composed of numeric digits, such that an input of "123a" sets x to "123" and y to "a". We could use pattern match guards for very limited use cases, but they wouldn't be powerful enough to solve meaningful problems in this space, since guards can only validate after the fact, not guide the bounds of each interpolation match.

These are things that existing techniques, including list matching, can do pretty well, provided that we just make sure they're pretty cheap (i.e. not re-splitting a whole string on each recursive step).

Also, what if I want to use interpolation based on already defined variables, not to match within the string, but to construct the string to be matched as a whole?

# toy example
isAlreadyConcatenated = \str, a, b ->
  when str is
    "\(a)\(b)" -> true
    "\(b)\(a)" -> true
    _ -> false

We can always add interpolation-based matching later, but I suspect that, even later, we'll determine that it creates more of an expectation of capability then we're willing to support. i.e. "that's all it does? / that's all it can be used for?" At that point it probably isn't worth supporting unless we're willing to make this kind of string matching, specifically, a highlight feature of the language (i.e. start intentionally claiming and supporting some perl use-case territory).

Brendan Hansknecht (Dec 31 2023 at 01:09):

Hmm :thinking:. To me, I feel that I would find the simple case super convenient and nice in general. I feel like I might occasionally hit edge cases that want more power, but I would expect that to be rare enough that I can write custom functions for those cases.

Brendan Hansknecht (Dec 31 2023 at 01:10):

I do agree that the syntax is super flexible which can be quite dangerous for expectations

Brian Carroll (Jan 01 2024 at 00:55):

I just saw this article that might be relevant. Apparently OCaml has a package to implement binary pattern matching, vaguely like Erlang has. OCaml is similar enough to Roc that maybe it could translate?

Kilian Vounckx (Jun 28 2024 at 10:28):

Has any more thought been put into this? I think it would be a good feature to have, regardless of the exact syntax

Stream: ideas

Topic: pattern matching strings with rest

Brendan Hansknecht (Dec 28 2023 at 15:48):

Brendan Hansknecht (Dec 28 2023 at 16:08):

Richard Feldman (Dec 28 2023 at 16:12):

Isaac Van Doren (Dec 28 2023 at 17:21):

Richard Feldman (Dec 28 2023 at 18:03):

Richard Feldman (Dec 28 2023 at 18:04):

Richard Feldman (Dec 28 2023 at 18:06):

Richard Feldman (Dec 28 2023 at 18:07):

Richard Feldman (Dec 28 2023 at 18:08):

Richard Feldman (Dec 28 2023 at 18:09):

Richard Feldman (Dec 28 2023 at 18:10):

Agus Zubiaga (Dec 28 2023 at 18:17):

Hannes Nevalainen (Dec 28 2023 at 22:08):

Hannes Nevalainen (Dec 28 2023 at 22:10):

Brendan Hansknecht (Dec 28 2023 at 22:19):

Brendan Hansknecht (Dec 28 2023 at 22:21):

Ayaz Hafiz (Dec 28 2023 at 22:57):

Ayaz Hafiz (Dec 28 2023 at 22:58):

Brendan Hansknecht (Dec 28 2023 at 22:59):

Richard Feldman (Dec 29 2023 at 02:46):

Richard Feldman (Dec 29 2023 at 02:47):

Richard Feldman (Dec 29 2023 at 03:13):

Kevin Gillette (Dec 29 2023 at 03:19):

Brendan Hansknecht (Dec 29 2023 at 03:33):

Hannes Nevalainen (Dec 29 2023 at 06:30):

Hannes Nevalainen (Dec 29 2023 at 09:35):

Sven van Caem (Dec 29 2023 at 18:15):

Sven van Caem (Dec 29 2023 at 18:17):

Richard Feldman (Dec 29 2023 at 18:17):

Sven van Caem (Dec 29 2023 at 18:22):

Sven van Caem (Dec 29 2023 at 18:24):

Kevin Gillette (Dec 31 2023 at 00:34):

Brendan Hansknecht (Dec 31 2023 at 00:36):

Brendan Hansknecht (Dec 31 2023 at 00:36):

Kevin Gillette (Dec 31 2023 at 00:37):

Kevin Gillette (Dec 31 2023 at 00:38):

Brendan Hansknecht (Dec 31 2023 at 00:39):

Isaac Van Doren (Dec 31 2023 at 00:47):

Kevin Gillette (Dec 31 2023 at 01:07):

Brendan Hansknecht (Dec 31 2023 at 01:09):

Brendan Hansknecht (Dec 31 2023 at 01:10):

Brian Carroll (Jan 01 2024 at 00:55):

Kilian Vounckx (Jun 28 2024 at 10:28):