Stream: ideas

Topic: pattern matching strings with rest


view this post on Zulip Brendan Hansknecht (Dec 28 2023 at 15:48):

spin off of #ideas>Should we have Str.walkUtf8*?
With lists, it is quite common to use pattern matching with recursion to write parsers. Ex:

fn = \list, sum ->
    when list is
        ['0', .. as rest] -> fn rest sum
        ['1', .. as rest] -> fn rest (sum + 1)
        ...
        ['9', .. as rest] -> fn rest (sum + 9)
        _ -> sum

This is currently not possible to do with pattern matching on strings.
That said, we have the primitives to enable a user to write this manually via Str.startsWith and Str.splitFirst or Str.replaceFirst.
Why not enable direct pattern matching on Str with rest?

I am not sure the right syntax here, but it should enable matching the start, end, and getting the rest.
Here is a port of the same code in a random syntax:

fn = \list, sum ->
    when list is
        "0" with .. as rest -> fn rest sum
        "1" with .. as rest -> fn rest (sum + 1)
        ...
        "9" with .. as rest -> fn rest (sum + 9)
        _ -> sum

I really don't know a good syntax for this, but I think it would be a really nice add to the language if we can find a syntax.

Would enable recursive pattern matching parsing that supports arbitrary unicode and small strings without an allocation.

view this post on Zulip Brendan Hansknecht (Dec 28 2023 at 16:08):

Also would mean you can directly match "😀" with .. as rest instead of [0x01, 0xF6, 0x00, .. as rest]

view this post on Zulip Richard Feldman (Dec 28 2023 at 16:12):

one idea that sounds ergonomic (but potentially tricky to implement) would be to allow interpolation in string patterns, e.g.:

when str is
    "blah\(rest)" -> ...

view this post on Zulip Isaac Van Doren (Dec 28 2023 at 17:21):

This would be great!

view this post on Zulip Richard Feldman (Dec 28 2023 at 18:03):

there's definitely a rabbit hole there haha

view this post on Zulip Richard Feldman (Dec 28 2023 at 18:04):

like what if I do this:

when str is
    "foo\(bar)baz\(blah)" -> ...
    "blah\(rest)" -> ...

at that point it's partway to running a series of regular expressions on each string

view this post on Zulip Richard Feldman (Dec 28 2023 at 18:06):

which on its own is doable (although certainly more complex than what we have today) but then what about exhaustiveness checking?

view this post on Zulip Richard Feldman (Dec 28 2023 at 18:07):

also there are various edge cases which should probably generate warnings, e.g. here \(y) would always match empty string because \(x) would have consumed everything

when str is
    "\(x)\(y)" -> ...

view this post on Zulip Richard Feldman (Dec 28 2023 at 18:08):

regarding exhaustiveness, "\(x)" -> on its own would be equivalent to x ->, so that's exhaustive

view this post on Zulip Richard Feldman (Dec 28 2023 at 18:09):

but then also any of the "\(x)\(y)\(z)" -> variations would be exhaustive too (but with warnings about unused identifiers)

view this post on Zulip Richard Feldman (Dec 28 2023 at 18:10):

I guess actually exhaustiveness wouldn't be that tricky, in the sense that as soon as you have any non-interpolation content in the string literal, it's no longer exhaustive

view this post on Zulip Agus Zubiaga (Dec 28 2023 at 18:17):

It's pretty cool, but I wonder how often it'd be actually useful. I have a feeling people will start using this, realize it's not quite enough for their use case, and demand full regex support :upside_down:

view this post on Zulip Hannes Nevalainen (Dec 28 2023 at 22:08):

https://roc.zulipchat.com/#narrow/stream/231634-beginners/topic/Pattern.20matching.20on.20strings.2Fbinaries

view this post on Zulip Hannes Nevalainen (Dec 28 2023 at 22:10):

The way this works in erlang/Elixir is quite powerful and maybe be worth looking at for inspiration :)

view this post on Zulip Brendan Hansknecht (Dec 28 2023 at 22:19):

<> is the concat operator or special syntax for the pattern matching specifically

view this post on Zulip Brendan Hansknecht (Dec 28 2023 at 22:21):

If concat "one" <> rest would be equivalent to "one\(rest)" in current roc. So that would suggest using interpolation for variable binding in roc.

view this post on Zulip Ayaz Hafiz (Dec 28 2023 at 22:57):

Yeah, I would definitely save the string pattern matching to regexes or weakened form that's syntax sugar over list pattern matching

view this post on Zulip Ayaz Hafiz (Dec 28 2023 at 22:58):

https://www.egison.org/ is interesting but doing this kind of stuff is super complicated

view this post on Zulip Brendan Hansknecht (Dec 28 2023 at 22:59):

I think a good start that could be expanded later if we want is just matching the prefix, rest, and suffix

view this post on Zulip Richard Feldman (Dec 29 2023 at 02:46):

Agus Zubiaga said:

It's pretty cool, but I wonder how often it'd be actually useful.

one way it could be used: for matching URLs, it could save splitting the string on "/" first:

["users", username, "posts", postId] ->

vs

"/users/\(username)/posts/\(postId)" ->

view this post on Zulip Richard Feldman (Dec 29 2023 at 02:47):

also could facilitate breaking down URLs into pieces and matching on separate parts separately, e.g.

"/users/\(username)/\(rest)" ->
    when rest is
        "posts/\(postId)" -> ...
        "profile" -> ...

view this post on Zulip Richard Feldman (Dec 29 2023 at 03:13):

heh, the backslash-escapes for interpolation look awkward there, but if we had a different interpolation syntax e.g...

"/users/${username}/posts/${postId}" -> ...
"/users/${username}/${rest}" ->
    when rest is
        "comments/${commentId}" -> ...
        "profile" -> ...

view this post on Zulip Kevin Gillette (Dec 29 2023 at 03:19):

If we had a concatenation operator, we could do:

fn = \str, sum ->
    when str is
        "0" ++ rest -> fn rest sum
        "1" ++ rest -> fn rest (sum + 1)
        ...
        "9" ++ rest -> fn rest (sum + 9)
        _ -> sum

view this post on Zulip Brendan Hansknecht (Dec 29 2023 at 03:33):

Is there an inherent advantage of that over using interpolation? I think it boils down to the same thing.

view this post on Zulip Hannes Nevalainen (Dec 29 2023 at 06:30):

Brendan Hansknecht said:

<> is the concat operator or special syntax for the pattern matching specifically

The same operator is used for concatenation of strings (binaries). It can only be used last in a pattern match though so it is like a “rest” param.

The cool thing is that you can pattern match and bind variables on the bytes. “Extract a 2-byte integer to variable a, followed by the byte value 57 and bind the rest to the variable b”

view this post on Zulip Hannes Nevalainen (Dec 29 2023 at 09:35):

It is very convenient for parsing binary protocols and formats.

view this post on Zulip Sven van Caem (Dec 29 2023 at 18:15):

Richard Feldman said:

also could facilitate breaking down URLs into pieces and matching on separate parts separately, e.g.

"/users/\(username)/\(rest)" ->
    when rest is
        "posts/\(postId)" -> ...
        "profile" -> ...

Wouldn't this leave room for ambiguities, like:
"/users/foo/bar/baz/"
could match into
username = "foo"
rest = "bar/baz"
or
username = "foo/bar"
rest = "baz"

view this post on Zulip Sven van Caem (Dec 29 2023 at 18:17):

you could make rules to make one of these the "principle" one but that doesn't feel satisfying imo

view this post on Zulip Richard Feldman (Dec 29 2023 at 18:17):

it's a good question - I think it would have to be a greedy match (so, it would always match username = "foo")

view this post on Zulip Sven van Caem (Dec 29 2023 at 18:22):

If I'd have to pick I'd probably land on that too, but if I encountered this feature for the first time I feel like I'd have to google or test this behavior before using it-- and that's if the potential ambiguity even occurred to me

view this post on Zulip Sven van Caem (Dec 29 2023 at 18:24):

but then again I've never used Erlang/Elixr or another language that already has this so this might be the unfamiliarity talking

view this post on Zulip Kevin Gillette (Dec 31 2023 at 00:34):

i think my preference would be guarantees that converting a string to an list of bytes, codepoints, or grapheme clusters would be be especially convenient when pattern matching. That way we could use the already effective list matching syntax while remaining confident about performance.

Generally, converting a Str to List U8 should just be a refcount increment, but for other key functions, given something like...

when Str.graphemes str is
  ["a", "b", ..] -> 1
  ["a", ..] -> 2
  ["c", ..] -> 3

I would hope (in the eventual future) that the above would not actually split the whole string into graphemes, but would instead decode at most the first two graphemes.

I would hope the same would apply with pattern matching on the result of Str.toScalars as well.

view this post on Zulip Brendan Hansknecht (Dec 31 2023 at 00:36):

A Str to a List U8 only works that way for big strings

view this post on Zulip Brendan Hansknecht (Dec 31 2023 at 00:36):

As for graphemes, while theoretically doable, hard (impossible?) to make generically apply (though maybe it only needs to apply for common builtins)

view this post on Zulip Kevin Gillette (Dec 31 2023 at 00:37):

Okay, fair. For strings small enough to have special handling, the copying cost isn't too consequential, I guess

view this post on Zulip Kevin Gillette (Dec 31 2023 at 00:38):

And I'm not suggesting this be a generic thing. Literally just optimization support for just those 2-3 Str functions with pattern matching as a roadmap item.

view this post on Zulip Brendan Hansknecht (Dec 31 2023 at 00:39):

Ah, yeah. That should be doable with some grit to update the pattern matching code for variable lengths

view this post on Zulip Isaac Van Doren (Dec 31 2023 at 00:47):

That way we could use the already effective list matching syntax

Typing out a pattern as a list is significantly more annoying to me than being able to just type a string. Using lists also stops you from being able to copy and paste existing strings for use in a pattern.

view this post on Zulip Kevin Gillette (Dec 31 2023 at 01:07):

Using interpolation to pull data back out of strings seems neat but also not great to me. The CUE language has identical interpolation syntax, and is also quite focused on pattern matching, including matching based off of interpolated strings.

Because of how it works, I think CUE could get there (i.e. match on "\(a)+\(b)" and then constrain a and b in different ways, such as based on a regex). It's very complex to achieve, and it's not there yet, and it might be that they decide to not support that kind of transitive-constraint string matching at all.

Conversely, Roc is not a constraint language, and so can't rely on an accumulation of rules to give the matching fidelity some people might hope for. Given that conversations about interpretation-based formatting have been reasonably non-committal, I don't see Roc adding, e.g. richer interpolation syntax to say: match "\(x)\(y)" but x must be composed of numeric digits, such that an input of "123a" sets x to "123" and y to "a". We could use pattern match guards for very limited use cases, but they wouldn't be powerful enough to solve meaningful problems in this space, since guards can only validate after the fact, not guide the bounds of each interpolation match.

These are things that existing techniques, including list matching, can do pretty well, provided that we just make sure they're pretty cheap (i.e. not re-splitting a whole string on each recursive step).

Also, what if I want to use interpolation based on already defined variables, not to match within the string, but to construct the string to be matched as a whole?

# toy example
isAlreadyConcatenated = \str, a, b ->
  when str is
    "\(a)\(b)" -> true
    "\(b)\(a)" -> true
    _ -> false

We can always add interpolation-based matching later, but I suspect that, even later, we'll determine that it creates more of an expectation of capability then we're willing to support. i.e. "that's all it does? / that's all it can be used for?" At that point it probably isn't worth supporting unless we're willing to make this kind of string matching, specifically, a highlight feature of the language (i.e. start intentionally claiming and supporting some perl use-case territory).

view this post on Zulip Brendan Hansknecht (Dec 31 2023 at 01:09):

Hmm :thinking:. To me, I feel that I would find the simple case super convenient and nice in general. I feel like I might occasionally hit edge cases that want more power, but I would expect that to be rare enough that I can write custom functions for those cases.

view this post on Zulip Brendan Hansknecht (Dec 31 2023 at 01:10):

I do agree that the syntax is super flexible which can be quite dangerous for expectations

view this post on Zulip Brian Carroll (Jan 01 2024 at 00:55):

I just saw this article that might be relevant. Apparently OCaml has a package to implement binary pattern matching, vaguely like Erlang has. OCaml is similar enough to Roc that maybe it could translate?

https://practicalocaml.com/parsing-with-binary-string-pattern-matching/

view this post on Zulip Kilian Vounckx (Jun 28 2024 at 10:28):

Has any more thought been put into this? I think it would be a good feature to have, regardless of the exact syntax


Last updated: Jun 16 2026 at 16:19 UTC