Str: more trim functions · ideas

It would be useful to have string trimming functions which can accept an arbitrary prefix or suffix (presently, there are only functions for trimming spacing). These could be:

Str.trimPrefix Str, Str -> Str
Str.trimSuffix Str, Str -> Str

Str.trimGraphemes Str, Str -> Str
Str.trimLeftGraphemes Str, Str -> Str
Str.trimRightGraphemes Str, Str -> Str

Finally, it would be useful to trim based on a callback function, but it's not clear whether that function would take U8 (a "Utf8" variant) or Str (a "Grapheme" variant). Perhaps both variants could exist.

Richard Feldman (Dec 23 2022 at 11:38):

Kevin Gillette (Dec 23 2022 at 18:04):

Developer-cheap string processing for the prefix and suffix processing: if you've got a fixed format with static aspects, such as non-structured logging or Advent of Code inputs, you don't need to write a formal parser to process it well. Imagine something like:

time=1234 sample=5 type=temp
time=1234 sample=23 type=humidity
time=5678 sample=8 type=temp

If you just care about the temperature, you can consider lines that end in " type=temp" and trim off that same suffix, then read the time and sample by either: 1) splitting on space and doing prefix trims of "time=" and "sample=" respectively, or 2) getting the time by trimming by callback, or graphemes, any non-number on the left and any non-space on the right (and doing similar, with left and right operations flipped, for the sample).

Aside from regex (which has well understood readability and power issues compared to something like Rosie), these are familiar string processing techniques to people coming especially from imperative languages. The reason for having variants with overlapping capabilities is that programmers often have their own preferred tools for solving these needs, and by not having the options they're used to, it adds friction.

In my case, I took a 30 minute break from solving an AoC puzzle when I couldn't figure out how to parse the input using the options available short of doing tedious subsplits or converting to Utf8 or graphemes and back again (the input had something like AA flow=12; ... in it). I was also not yet prepared to figure out how to write parser combinators in Roc, as I was fairly new to the technique in any language.

I came back when I realized I could make helpers based on replaceFirst to delete flow= and ; to transform it into a space splitting problem. I hadn't considered that, because in the garbage collected language I've spent a decade working in, Go, string manipulation beyond what amounts to just slicing will allocate on the heap, and making throwaway intermediate strings is not usually a great idea. Certainly that's less of an issue for Roc's trade-offs. However, even in Roc, the proposed functions would not need to allocate new strings as the replace functions may need to do (setting aside any in-place optimizations available).

Kevin Gillette (Dec 23 2022 at 19:03):

TrimPolicy : [
    While, # Trim as long as the condition holds.
    Until, # Trim as long as the condition does not hold.
    Through, # Equivalent to trimming Until (once), then While (once).
]

TrimCondition : [
    Spacing,
    Digit, # Equivalent to regexp [0-9]
    Letter, # Matches Unicode Letter class.
    Word, # Equivalent to regexp \w
    Literal Str,
    Graphemes Str,
    MatchGrapheme (Str -> Bool), # Receives one extended grapheme cluster at a time.
    MatchUtf8 (U8 -> Bool), # Receives one Utf8 byte at a time.
    MatchScalar (U32 -> Bool), # Receives on Unicode codepoint at a time.
]

trimLeft : Str, TrimPolicy TrimCondition -> Str

trimRight : Str, TrimPolicy TrimCondition -> Str

trimBoth : Str, TrimPolicy TrimCondition -> Str

# The opposite of trim: keepLeft is an inverse of trimRight,
# and could only be represented using trim functions with more TrimPolicy tags,
# such as Before and After.
# This inverse relationship is analogous to how the inverse of `<` is `>=` rather than `>`.

keepLeft : Str, TrimPolicy TrimCondition -> Str

keepRight : Str, TrimPolicy TrimCondition -> Str

# There is no keepBoth because that would require allocating a new string
# (i.e. keeping the sides but not the middle).

A note on MatchUtf8: I know there'll be reasonable concerns about producing invalid strings as a result of, say, returning true on part of an encoded sequence but not on the rest. We could simply make it a rule that if the function discards any part of an encoded sequence, then the remainder of the sequence is discarded. The meaning of "encoded sequence" would at least be "scalar" (codepoint), but could also be "extended grapheme cluster" if that's cleaner (though in the cluster case, we'd probably then want to apply the same rule to MatchScalar).

The utility of MatchUtf8 (and MatchScalar) is that the most concise way to specify, and combine, matchers for ASCII digits and letters is using a function like (\b -> '0' <= b && b <= '9'). If strings were comparable via ordering operations, i.e. "abc" < "def", then MatchUtf8 and MatchScalar would have less utility (they'd still be useful for discarding non-ASCII, though if that were a common need, we could add more TrimCondition tag(s), like Ascii or NonAscii.

Stream: ideas

Topic: Str: more trim functions

Kevin Gillette (Dec 23 2022 at 05:37):

Richard Feldman (Dec 23 2022 at 11:38):

Kevin Gillette (Dec 23 2022 at 18:04):

Kevin Gillette (Dec 23 2022 at 19:03):