canonicalization · contributing · Zulip Chat Archive

Just to make sure that I'm on the right track for https://github.com/roc-lang/roc/issues/5235
The point of interest is the crates/compiler/can/src/expr.rs, canonicalize_expr() where it matches to ast::Expr::Str, right?

Folkert de Vries (Mar 31 2023 at 11:02):

yes

itmuckel (Mar 31 2023 at 11:17):

So the PlainLine-Case here:

fn flatten_str_literal<'a>(
    env: &mut Env<'a>,
    var_store: &mut VarStore,
    scope: &mut Scope,
    literal: &StrLiteral<'a>,
) -> (Expr, Output) {
    use ast::StrLiteral::*;

    match literal {
        PlainLine(str_slice) => (Expr::Str((*str_slice).into()), Output::default()),
        Line(segments) => flatten_str_lines(env, var_store, scope, &[segments]),
        Block(lines) => flatten_str_lines(env, var_store, scope, lines),
    }
}

needs similar Unicode handling like it happens in flatten_str_lines(...) instead of just taking the inner string slice and turning it into an expression without checking?

Folkert de Vries (Mar 31 2023 at 11:33):

looks like it

itmuckel (Mar 31 2023 at 12:11):

Okay, I looked into how the parser creates string literals and actually it should already recognize utf8 and create a StrLiteral::Line (or StrLiteral::Block) out of it. This should actually work fine and create multiple segments out of "\u(C3)\u(28)", but as I understand it the whole sequence isn't valid utf-8, so it should be checked as a whole? As a side note, the repl outputs Ã( for the sequence, which could be valid?

Folkert de Vries (Mar 31 2023 at 12:20):

that is also the output that python gives for those octets

Folkert de Vries (Mar 31 2023 at 12:21):

probably best to try this with rust, which does actual validation

itmuckel (Mar 31 2023 at 13:13):

yes, interesting

 rustc -o main main.rs
error: incorrect unicode escape sequence
 --> main.rs:2:15
  |
2 |     println!("\u(C3)\u(28)!");
  |               ^^^ incorrect unicode escape sequence
  |
  = help: format of unicode escape sequences is `\u{...}`

error: incorrect unicode escape sequence
 --> main.rs:2:21
  |
2 |     println!("\u(C3)\u(28)!");
  |                     ^^^ incorrect unicode escape sequence
  |
  = help: format of unicode escape sequences is `\u{...}`

error: aborting due to 2 previous errors

exit status 1

Folkert de Vries (Mar 31 2023 at 13:14):

thisis a rust syntax error

itmuckel (Mar 31 2023 at 13:15):

Wait, no, I just used the wrong syntax :sweat_smile:

 rustc -o main main.rs
 ./main
Ã(!


itmuckel (Mar 31 2023 at 13:15):

So it parses correctly actually. I don't think this is a bug in roc :thinking:

Folkert de Vries (Mar 31 2023 at 13:15):

@Richard Feldman what is happening here?

itmuckel (Mar 31 2023 at 13:20):

I think I know what he means, technically C328 is not a correct utf-8 encoded sequence

itmuckel (Mar 31 2023 at 13:21):

although \u{c328} returns a value :sweat_smile:

Richard Feldman (Mar 31 2023 at 13:22):

yeah I'm surprised by Rust's behavior here: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=c2a6da700293f9b6b96b7940cd597ff5 :face_with_raised_eyebrow:

Richard Feldman (Mar 31 2023 at 13:22):

if you ask it what it thinks of those two bytes encoded as UTF-8, it gives an error

Richard Feldman (Mar 31 2023 at 13:23):

but you can put them in a string literal and it's fine with it!

Richard Feldman (Mar 31 2023 at 13:27):

hmm, ok I think I see what's happening

Richard Feldman (Mar 31 2023 at 13:27):

it's treating it as a 32-bit scalar value

Richard Feldman (Mar 31 2023 at 13:27):

and the byte 0xC3u8 is not valid, but 0xC3u32 is a valid Unicode scalar value - and in string escapes, it's padded to u32 automatically

Richard Feldman (Mar 31 2023 at 13:28):

ok, so I think the conclusion is that it looks wrong but it's actually correct :laughing:

itmuckel (Mar 31 2023 at 13:29):

Okay, so Issue should be closed? :eyes:

Richard Feldman (Mar 31 2023 at 13:29):

my mistake!

Richard Feldman (Mar 31 2023 at 13:29):

yeah, with a link to this discussion probably https://roc.zulipchat.com/#narrow/stream/316715-contributing/topic/canonicalization/near/345927706

itmuckel (Mar 31 2023 at 13:30):

all good, helped me set up the project and jump into the compiler :nerd:

Richard Feldman (Mar 31 2023 at 13:30):

awesome! :smiley:

Richard Feldman (Mar 31 2023 at 13:31):

if you're interested in getting into another canonicalization bug, here's one: https://github.com/roc-lang/roc/issues/3296

itmuckel (Mar 31 2023 at 14:38):

I don't know what abilities are yet and I'm not sure how much time I'll get the next week. If it's not urgent and the fix doesn't involve understanding the whole project, I'm interested :smiley:

Richard Feldman (Mar 31 2023 at 17:04):

it's definitely not urgent, and shouldn't require much understanding!

Richard Feldman (Mar 31 2023 at 17:04):

happy to explain more about abilities and answer questions about the project once you're ready to get started :smiley:

itmuckel (Mar 31 2023 at 20:48):

btw: I don't think I can close this: https://github.com/roc-lang/roc/issues/5235 :')

Ayaz Hafiz (Mar 31 2023 at 20:49):

closed it!

Last updated: Aug 17 2025 at 12:14 UTC

Stream: contributing

Topic: canonicalization

itmuckel (Mar 31 2023 at 10:59):

Folkert de Vries (Mar 31 2023 at 11:02):

itmuckel (Mar 31 2023 at 11:17):

Folkert de Vries (Mar 31 2023 at 11:33):

itmuckel (Mar 31 2023 at 12:11):

Folkert de Vries (Mar 31 2023 at 12:20):

Folkert de Vries (Mar 31 2023 at 12:21):

itmuckel (Mar 31 2023 at 13:13):

Folkert de Vries (Mar 31 2023 at 13:14):

itmuckel (Mar 31 2023 at 13:15):

itmuckel (Mar 31 2023 at 13:15):

Folkert de Vries (Mar 31 2023 at 13:15):

itmuckel (Mar 31 2023 at 13:20):

itmuckel (Mar 31 2023 at 13:21):

Richard Feldman (Mar 31 2023 at 13:22):

Richard Feldman (Mar 31 2023 at 13:22):

Richard Feldman (Mar 31 2023 at 13:23):

Richard Feldman (Mar 31 2023 at 13:27):

Richard Feldman (Mar 31 2023 at 13:27):

Richard Feldman (Mar 31 2023 at 13:27):

Richard Feldman (Mar 31 2023 at 13:28):

itmuckel (Mar 31 2023 at 13:29):

Richard Feldman (Mar 31 2023 at 13:29):

Richard Feldman (Mar 31 2023 at 13:29):

itmuckel (Mar 31 2023 at 13:30):

Richard Feldman (Mar 31 2023 at 13:30):

Richard Feldman (Mar 31 2023 at 13:31):

itmuckel (Mar 31 2023 at 14:38):

Richard Feldman (Mar 31 2023 at 17:04):

Richard Feldman (Mar 31 2023 at 17:04):

itmuckel (Mar 31 2023 at 20:48):

Ayaz Hafiz (Mar 31 2023 at 20:49):