Stream: contributing

Topic: canonicalization


view this post on Zulip itmuckel (Mar 31 2023 at 10:59):

Just to make sure that I'm on the right track for https://github.com/roc-lang/roc/issues/5235
The point of interest is the crates/compiler/can/src/expr.rs, canonicalize_expr() where it matches to ast::Expr::Str, right?

view this post on Zulip Folkert de Vries (Mar 31 2023 at 11:02):

yes

view this post on Zulip itmuckel (Mar 31 2023 at 11:17):

So the PlainLine-Case here:

fn flatten_str_literal<'a>(
    env: &mut Env<'a>,
    var_store: &mut VarStore,
    scope: &mut Scope,
    literal: &StrLiteral<'a>,
) -> (Expr, Output) {
    use ast::StrLiteral::*;

    match literal {
        PlainLine(str_slice) => (Expr::Str((*str_slice).into()), Output::default()),
        Line(segments) => flatten_str_lines(env, var_store, scope, &[segments]),
        Block(lines) => flatten_str_lines(env, var_store, scope, lines),
    }
}

needs similar Unicode handling like it happens in flatten_str_lines(...) instead of just taking the inner string slice and turning it into an expression without checking?

view this post on Zulip Folkert de Vries (Mar 31 2023 at 11:33):

looks like it

view this post on Zulip itmuckel (Mar 31 2023 at 12:11):

Okay, I looked into how the parser creates string literals and actually it should already recognize utf8 and create a StrLiteral::Line (or StrLiteral::Block) out of it. This should actually work fine and create multiple segments out of "\u(C3)\u(28)", but as I understand it the whole sequence isn't valid utf-8, so it should be checked as a whole? As a side note, the repl outputs Ã( for the sequence, which could be valid?

view this post on Zulip Folkert de Vries (Mar 31 2023 at 12:20):

that is also the output that python gives for those octets

view this post on Zulip Folkert de Vries (Mar 31 2023 at 12:21):

probably best to try this with rust, which does actual validation

view this post on Zulip itmuckel (Mar 31 2023 at 13:13):

yes, interesting

 rustc -o main main.rs
error: incorrect unicode escape sequence
 --> main.rs:2:15
  |
2 |     println!("\u(C3)\u(28)!");
  |               ^^^ incorrect unicode escape sequence
  |
  = help: format of unicode escape sequences is `\u{...}`

error: incorrect unicode escape sequence
 --> main.rs:2:21
  |
2 |     println!("\u(C3)\u(28)!");
  |                     ^^^ incorrect unicode escape sequence
  |
  = help: format of unicode escape sequences is `\u{...}`

error: aborting due to 2 previous errors

exit status 1

view this post on Zulip Folkert de Vries (Mar 31 2023 at 13:14):

thisis a rust syntax error

view this post on Zulip itmuckel (Mar 31 2023 at 13:15):

Wait, no, I just used the wrong syntax :sweat_smile:

 rustc -o main main.rs
 ./main
Ã(!


view this post on Zulip itmuckel (Mar 31 2023 at 13:15):

So it parses correctly actually. I don't think this is a bug in roc :thinking:

view this post on Zulip Folkert de Vries (Mar 31 2023 at 13:15):

@Richard Feldman what is happening here?

view this post on Zulip itmuckel (Mar 31 2023 at 13:20):

I think I know what he means, technically C328 is not a correct utf-8 encoded sequence

view this post on Zulip itmuckel (Mar 31 2023 at 13:21):

although \u{c328} returns a value :sweat_smile:

view this post on Zulip Richard Feldman (Mar 31 2023 at 13:22):

yeah I'm surprised by Rust's behavior here: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=c2a6da700293f9b6b96b7940cd597ff5 :face_with_raised_eyebrow:

view this post on Zulip Richard Feldman (Mar 31 2023 at 13:22):

if you ask it what it thinks of those two bytes encoded as UTF-8, it gives an error

view this post on Zulip Richard Feldman (Mar 31 2023 at 13:23):

but you can put them in a string literal and it's fine with it!

view this post on Zulip Richard Feldman (Mar 31 2023 at 13:27):

hmm, ok I think I see what's happening

view this post on Zulip Richard Feldman (Mar 31 2023 at 13:27):

it's treating it as a 32-bit scalar value

view this post on Zulip Richard Feldman (Mar 31 2023 at 13:27):

and the byte 0xC3u8 is not valid, but 0xC3u32 is a valid Unicode scalar value - and in string escapes, it's padded to u32 automatically

view this post on Zulip Richard Feldman (Mar 31 2023 at 13:28):

ok, so I think the conclusion is that it looks wrong but it's actually correct :laughing:

view this post on Zulip itmuckel (Mar 31 2023 at 13:29):

Okay, so Issue should be closed? :eyes:

view this post on Zulip Richard Feldman (Mar 31 2023 at 13:29):

my mistake!

view this post on Zulip Richard Feldman (Mar 31 2023 at 13:29):

yeah, with a link to this discussion probably https://roc.zulipchat.com/#narrow/stream/316715-contributing/topic/canonicalization/near/345927706

view this post on Zulip itmuckel (Mar 31 2023 at 13:30):

all good, helped me set up the project and jump into the compiler :nerd:

view this post on Zulip Richard Feldman (Mar 31 2023 at 13:30):

awesome! :smiley:

view this post on Zulip Richard Feldman (Mar 31 2023 at 13:31):

if you're interested in getting into another canonicalization bug, here's one: https://github.com/roc-lang/roc/issues/3296

view this post on Zulip itmuckel (Mar 31 2023 at 14:38):

I don't know what abilities are yet and I'm not sure how much time I'll get the next week. If it's not urgent and the fix doesn't involve understanding the whole project, I'm interested :smiley:

view this post on Zulip Richard Feldman (Mar 31 2023 at 17:04):

it's definitely not urgent, and shouldn't require much understanding!

view this post on Zulip Richard Feldman (Mar 31 2023 at 17:04):

happy to explain more about abilities and answer questions about the project once you're ready to get started :smiley:

view this post on Zulip itmuckel (Mar 31 2023 at 20:48):

btw: I don't think I can close this: https://github.com/roc-lang/roc/issues/5235 :')

view this post on Zulip Ayaz Hafiz (Mar 31 2023 at 20:49):

closed it!


Last updated: Jul 06 2025 at 12:14 UTC