Just to make sure that I'm on the right track for https://github.com/roc-lang/roc/issues/5235
The point of interest is the crates/compiler/can/src/expr.rs
, canonicalize_expr()
where it matches to ast::Expr::Str
, right?
yes
So the PlainLine-Case here:
fn flatten_str_literal<'a>(
env: &mut Env<'a>,
var_store: &mut VarStore,
scope: &mut Scope,
literal: &StrLiteral<'a>,
) -> (Expr, Output) {
use ast::StrLiteral::*;
match literal {
PlainLine(str_slice) => (Expr::Str((*str_slice).into()), Output::default()),
Line(segments) => flatten_str_lines(env, var_store, scope, &[segments]),
Block(lines) => flatten_str_lines(env, var_store, scope, lines),
}
}
needs similar Unicode handling like it happens in flatten_str_lines(...)
instead of just taking the inner string slice and turning it into an expression without checking?
looks like it
Okay, I looked into how the parser creates string literals and actually it should already recognize utf8 and create a StrLiteral::Line
(or StrLiteral::Block
) out of it. This should actually work fine and create multiple segments out of "\u(C3)\u(28)"
, but as I understand it the whole sequence isn't valid utf-8, so it should be checked as a whole? As a side note, the repl outputs Ã( for the sequence, which could be valid?
that is also the output that python gives for those octets
probably best to try this with rust, which does actual validation
yes, interesting
rustc -o main main.rs
error: incorrect unicode escape sequence
--> main.rs:2:15
|
2 | println!("\u(C3)\u(28)!");
| ^^^ incorrect unicode escape sequence
|
= help: format of unicode escape sequences is `\u{...}`
error: incorrect unicode escape sequence
--> main.rs:2:21
|
2 | println!("\u(C3)\u(28)!");
| ^^^ incorrect unicode escape sequence
|
= help: format of unicode escape sequences is `\u{...}`
error: aborting due to 2 previous errors
exit status 1
thisis a rust syntax error
Wait, no, I just used the wrong syntax :sweat_smile:
rustc -o main main.rs
./main
Ã(!
So it parses correctly actually. I don't think this is a bug in roc :thinking:
@Richard Feldman what is happening here?
I think I know what he means, technically C328 is not a correct utf-8 encoded sequence
although \u{c328} returns a value :sweat_smile:
yeah I'm surprised by Rust's behavior here: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=c2a6da700293f9b6b96b7940cd597ff5 :face_with_raised_eyebrow:
if you ask it what it thinks of those two bytes encoded as UTF-8, it gives an error
but you can put them in a string literal and it's fine with it!
hmm, ok I think I see what's happening
it's treating it as a 32-bit scalar value
and the byte 0xC3u8
is not valid, but 0xC3u32
is a valid Unicode scalar value - and in string escapes, it's padded to u32
automatically
ok, so I think the conclusion is that it looks wrong but it's actually correct :laughing:
Okay, so Issue should be closed? :eyes:
my mistake!
yeah, with a link to this discussion probably https://roc.zulipchat.com/#narrow/stream/316715-contributing/topic/canonicalization/near/345927706
all good, helped me set up the project and jump into the compiler :nerd:
awesome! :smiley:
if you're interested in getting into another canonicalization bug, here's one: https://github.com/roc-lang/roc/issues/3296
I don't know what abilities are yet and I'm not sure how much time I'll get the next week. If it's not urgent and the fix doesn't involve understanding the whole project, I'm interested :smiley:
it's definitely not urgent, and shouldn't require much understanding!
happy to explain more about abilities and answer questions about the project once you're ready to get started :smiley:
btw: I don't think I can close this: https://github.com/roc-lang/roc/issues/5235 :')
closed it!
Last updated: Jul 06 2025 at 12:14 UTC