Roc "multipart/form-data" parser · beginners

I'm having trouble, parsing multipart/form-data, since the builtin List lacks a splitSequence function.

Adrian (Apr 09 2024 at 10:07):

Is there a known parser for this format? Should I just bite the bullet and implement it with roc-parser, or am I misusing roc's Lists

Adrian (Apr 09 2024 at 10:17):

Luke Boswell (Apr 09 2024 at 10:56):

Luke Boswell (Apr 09 2024 at 10:58):

Luke Boswell (Apr 09 2024 at 10:59):

Luke Boswell (Apr 09 2024 at 11:02):

Adrian (Apr 09 2024 at 11:23):

-----------------------------306137057221634204471155786395
Content-Disposition: form-data; name="text"

abc
-----------------------------306137057221634204471155786395
Content-Disposition: form-data; name="file"; filename=""
Content-Type: application/octet-stream


-----------------------------306137057221634204471155786395--

Adrian (Apr 09 2024 at 11:25):

I'm a bit worried about the performance :face_with_spiral_eyes:.
I'll try the roc only approach and if it doesn't work out I'll probably send a pr to either roc-builtins or roc-webserver.

Adrian (Apr 09 2024 at 12:18):

listSplit : List a, List a -> List (List a) where a implements Eq
listSplit = \inp, sep ->
    if List.len sep == 0 then
        [inp]
    else
        inp
        |> List.walk
            { out: [], buf: [], consume: [] }
            \{out, buf, consume}, elem ->
                if elem == List.get sep (List.len consume) |> unwrap then
                    { consume: consume |> List.append elem, out, buf} |> tryToConsume sep
                else
                    {consume:[], out, buf: buf |> List.concat consume|> List.append elem}
        |> \{out,buf,consume} -> out|>List.append (List.concat buf consume)

tryToConsume = \{out,buf,consume},sep->
            if consume == sep then
                {out: out|> List.append buf, buf:[],consume:[]}
            else
                {out,buf,consume}

expect listSplit ("a/b/c/d"|>Str.toUtf8) ['/'] == ["a","b","c","d"]|>List.map Str.toUtf8
expect listSplit ("a/b/c/d"|>Str.toUtf8) ['/','/'] == ["a/b/c/d"]|>List.map Str.toUtf8
expect listSplit ("/a//b/c/d/"|>Str.toUtf8) ['/'] == ["","a","","b","c","d",""]|>List.map Str.toUtf8
expect listSplit ("//a////b//c//d/"|>Str.toUtf8) ['/','/'] == ["","a","","b","c","d/"]|>List.map Str.toUtf8

unwrap = \a->
       when a is
            Ok e -> e
            Err e -> crash "unwraped Err $(Inspect.toStr e)"

Adrian (Apr 09 2024 at 12:19):

Brendan Hansknecht (Apr 09 2024 at 14:53):

Adrian (Apr 09 2024 at 15:15):

Jep, that's why I was asking if I should implement it the same way (in zig) or just in roc

Adrian (Apr 09 2024 at 15:16):

Brendan Hansknecht (Apr 09 2024 at 15:37):

It should be fine to implement in Roc with good performance. All of the sub pieces needed for list are there.

I know that separately, some changes are planned for List.split that never got implemented. We wanted to make the signatures cleaner and more consistent: https://github.com/roc-lang/roc/issues/6164

That said, I think those changes would be around changing the current List.split to List.splitAt. Then implementing a new split function that is more simple to the string version, but still for single element splits. List.splitBy: List a, a -> List (List a). So no explicit plans for a 'List.splitSequences, but would still be a good general discussion to have. Maybe it should fully match Str.split` for flexibility.

Brendan Hansknecht (Apr 09 2024 at 15:40):

Anyway, for a performant version in Roc, it will be very important to use List.take and List.drop or List.sublist to extract the splits. I should be able to write something up later today. May also want to use a smart search algorithm like https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore%E2%80%93Horspool_algorithm, but that would be more complex to get write and use in roc. Linear scan is probably a fine start.

Adrian (Apr 10 2024 at 14:59):

Brendan Hansknecht (Apr 10 2024 at 15:39):

I don't think anyone is specifically working on the List.split rework. Someone claimed it in december, but it never progressed. Overall, it is open to ideas an contributors.

listSplitSeq = \list, seq ->
    if List.isEmpty seq then
        [list]
    else
        listSplitSeqHelper [] list seq 0 0 0

listSplitSeqHelper : List (List a), List a, List a, U64, U64, U64 -> List (List a) where a implements Eq & Inspect
listSplitSeqHelper = \out, list, seq, lastMatch, i, j ->
    when (List.get list (i + j), List.get seq j) is
        (Ok x, Ok y) if x == y ->
            # Still matching correct
            listSplitSeqHelper out list seq lastMatch i (j + 1)

        (Ok _, Ok _) ->
            # Failure to match
            listSplitSeqHelper out list seq lastMatch (i + 1) 0

        # (Ok _, Err OutOfBounds) ->
        (Ok _, Err _) ->
            # Managed to match the whole subseq
            subList = List.sublist list {start: lastMatch, len: i - lastMatch}
            nextOut = List.append out subList

            listSplitSeqHelper nextOut list seq (i + j) (i + j) 0

        (Err OutOfBounds, Err OutOfBounds) ->
            # Match all the way to the end, append last match and extra empty list
            subList = List.sublist list {start: lastMatch, len: i - lastMatch}
            nextOut = List.append out subList

            List.append nextOut []

        (Err OutOfBounds, _) ->
            # No more possible matches, append remaining
            subList = List.sublist list {start: lastMatch, len: i - lastMatch + 1}
            List.append out subList

It is just a linear scan, which is best for short sequences to match. To make it more optimal would be to add the fancy algorithm if the key is larger or the input list is really large. This might be worth making a builtin only for all of the extra bounds checks we could remove (especially if we add in the fancier algorithm).

Adrian (Apr 10 2024 at 16:30):

This is great! One minor question is it possible for anything to match (Ok _, Ok _) and not match (Ok x, Ok y). To me this seems like the same case?

Brendan Hansknecht (Apr 10 2024 at 17:25):

Brendan Hansknecht (Apr 10 2024 at 17:26):

Oh, one extra note List (List a) and List Str may have some general perf issues currently. I am actively working on a fix, but we have a refcounting issue with nested container types.

Adrian (Apr 10 2024 at 17:43):

Perhaps the quicket fix is to just send a patch to basic-webserver platform and implement the parser directly. Is there any other way to run native code other than platoform and builtins?

Brendan Hansknecht (Apr 10 2024 at 18:48):

nope (well, depends on what you mean by "native code". Both builtins and regular roc code compile the same way. Builtins just have minor abilities to cut corners. Platform is the only place where side effects can happen at all)

Brendan Hansknecht (Apr 10 2024 at 19:10):

Eh, I think it would be best to keep it in roc or the standard library. If this does lead to measurable perf issues, that just more pressure to improve roc.

Adrian (Apr 10 2024 at 20:28):

Brendan Hansknecht (Apr 11 2024 at 01:02):

Adrian (Apr 11 2024 at 07:17):

Oh, sorry; It was meant as an example of the prevous parser also being just native roc.

Stream: beginners

Topic: Roc "multipart/form-data" parser

Adrian (Apr 09 2024 at 10:06):

Adrian (Apr 09 2024 at 10:07):

Adrian (Apr 09 2024 at 10:17):

Luke Boswell (Apr 09 2024 at 10:56):

Luke Boswell (Apr 09 2024 at 10:58):

Luke Boswell (Apr 09 2024 at 10:59):

Luke Boswell (Apr 09 2024 at 11:02):

Adrian (Apr 09 2024 at 11:23):

Adrian (Apr 09 2024 at 11:25):

Adrian (Apr 09 2024 at 12:18):

Adrian (Apr 09 2024 at 12:19):

Brendan Hansknecht (Apr 09 2024 at 14:53):

Adrian (Apr 09 2024 at 15:15):

Adrian (Apr 09 2024 at 15:16):

Brendan Hansknecht (Apr 09 2024 at 15:37):

Brendan Hansknecht (Apr 09 2024 at 15:40):

Adrian (Apr 10 2024 at 14:59):

Brendan Hansknecht (Apr 10 2024 at 15:39):

Adrian (Apr 10 2024 at 16:30):

Brendan Hansknecht (Apr 10 2024 at 17:25):

Brendan Hansknecht (Apr 10 2024 at 17:26):

Adrian (Apr 10 2024 at 17:43):

Brendan Hansknecht (Apr 10 2024 at 18:48):

Brendan Hansknecht (Apr 10 2024 at 19:10):

Adrian (Apr 10 2024 at 20:28):

Brendan Hansknecht (Apr 11 2024 at 01:02):

Adrian (Apr 11 2024 at 07:17):

Adrian (Apr 11 2024 at 07:17):