Roc currently happily let's us include non-printable unicode characters in a roc Str. This can be confusing as in the below example -- where I had copied data from a CSV file and hadn't thought much deeper about it.
Luke Boswell said:
Here's an even smaller repro
module [] fromStr : Str -> _ fromStr = \raw -> if raw == "FOO" then FOO else if raw == "BAR" then BAR else if raw == "BAZ" then BAZ else OTHER expect actual = ["FOO", "BAR","BAZ"] |> List.map fromStr expected = [FOO, BAR, BAZ] actual == expectedactual = [OTHER, BAR, BAZ] expected = [FOO, BAR, BAZ]
See also this issue https://github.com/roc-lang/roc/issues/6919
Looking for ideas how we can avoid others suffering a similar fate? Or maybe this isn't something we should be concerned about? I definitely scratched my head for a few hours and didn't figure it out on my own. Big thanks to @Basile Henry
Ah - so the issue is that one or more of those strings has some invisible-ish characters in it?
I wonder if it would be enough to:
That does sounds like a great simple solution.
And expect should use Inspect to print. And for strings we should have it escape non-prontable characters.
I'm going to make a couple of issues for these
Format invisible unicode as a literal #6927
Warn users if a Str literal contains invisible unicode characters #6928
Escape unicode when inspecting a Str #6929
The above are Good First Issues, if anyone is interested in dipping their toes into the roc compiler. ![]()
Just to keep this with the other issue:
Modernize expect #6930
This is the work required to get expect to use Inspect.
This is essentially a repeat of the work that was already done for debug, but a bit more complex.
That said, the dbg PRs should show a lot of what needs to be done, which should be helpful.
we should do the same for repl value output too!
Warn users if a
Strliteral contains invisible unicode characters
If Roc knows about the set of invisible characters, why not disallow them completely from String literals and redirect to the unicode escape sequence in the parsing error?
I don't know what the general policy on warnings is for Roc, but I typically prefer a stricter compiler and make warnings hard errors when possible.
sometimes you may actually need them to be in the string
there are reasons Unicode added them, after all :big_smile:
but they can make debug output confusing
Sure, but that's what the unicode escape codes (or just escape sequences in general) are for, no? I doubt many programming languages accept any character in string literals. It would be very confusing if I was allowed to have a raw \r (not the escaped sequence) in the middle of a string literal.
I’m not sure if that’s significantly better than the “have the formatter translate them to escape sequences” design
seems like it would only be noticeable to people who aren’t using the formatter
Rust doesn't allow it for reference:
$ rustc <(echo -e "fn main() { println!(\"hello\rworld\"); }") -o out
error: bare CR not allowed in string, use `\r` instead
--> /dev/fd/63:1:28
|
world"); }n() { println!("hello
| ^ help: escape the character: `\r`
error: aborting due to 1 previous error
It seems to me that if it was allowed, you could hide some malicious functionality in your code that would be hidden on most terminals/text editors
Whereas Roc does:
$ echo -e 'app [main] { pf: platform "https://github.com/roc-lang/basic-cli/releases/download/0.12.0/Lb8EgiejTUzbggO2HVVuPJFkwvvsfW6LojkLR20kTVE.tar.br" }
import pf.Stdout
import pf.Task
main =
Stdout.line! "Hello\rWorld!"' > hello.roc
$ roc run hello.roc
World!
that's an interesting point - what do others think?
Admittedly Rust wouldn't have caught @Luke Boswell 's issue: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=aa369ba6dbda01936972af7ad5acec17 (my rough translation)
So maybe a formatter based solution does cover the most common accidental issues :+1:
Joshua Warner said:
Ah - so the issue is that one or more of those strings has some invisible-ish characters in it?
I wonder if it would be enough to:
- make the formatter write those out as unicode literals
- maybe add a warning about it when parsing or canonicalizing
Normalization ? https://www.unicode.org/faq/normalization.html
Basile Henry said:
Rust doesn't allow it for reference:
$ rustc <(echo -e "fn main() { println!(\"hello\rworld\"); }") -o out error: bare CR not allowed in string, use `\r` instead --> /dev/fd/63:1:28 | world"); }n() { println!("hello | ^ help: escape the character: `\r` error: aborting due to 1 previous errorIt seems to me that if it was allowed, you could hide some malicious functionality in your code that would be hidden on most terminals/text editors
I have a hard time imagining this being the case.
Like the code both looks normal enough that you trust it and run it, but has enough hidden unicode characters to do something malicious. (Also, you would see the warning during compilation, just wouldn't be strictly blocking.). So I don't think this hard edge is necessary.
Rasheed Starlet said:
Normalization ? https://www.unicode.org/faq/normalization.html
I think Normalization is too expensive of a process to make it a default. Raw binary comparison has a lot of value. Users can always opt into Normalization if they need it.
Also, why does Unicode have to have so many different ways to normalize.....
Could the error messages show the size in bytes of a string? If you see "Foo"(6 bytes) you would easily see the problem. And maybe the size could also be helpful in other cases.
I think this would also have helped so find the bug quickly: "Foo"(3 bytes) is not equal "Foo"(6 bytes)
that might help in this case, but wouldn't help in the situation where you have the same total number of invisible characters, but they aren't in the same place
also I think it would be noise almost all the time, so I'd prefer a solution that only makes the output different in the rare cases where this might actually cause a problem :big_smile:
An example of a security risk involving invisible characters is the infamous RTLO attack, where the Right-to-Left-Overide unicode character is used to make runnable code appear on the right, and look like a comment. There are also Unicode attacks involving visible but confusable characters, such as fake / in paths or URLs, fake letters in URLs or variable names, and so on. There's even an official page for confusables. For example, if you define the expeϲt function (with a fake c), who will suspect any problem?
IMHO, this deserves serious consideration. Code editors alleviate some of these issue (e.g., VSCode makes RTLO characters stand out very clearly, as well as some confusable characters), but you can't always rely on them.
That said, it's of course extremely useful to be able to copy/paste text that contains invisible characters (such as field separators, non-breakable spaces, RTL tags, escape codes, and more) directly into a code file, I do it all the time. Perhaps the compiler could be very strict (including rejecting confusables in names), but roc format could replace all the problematic characters with escape codes?
Good idea, we should restrict the characters for platform and package url strings as well
Is that last one a duplicate of https://github.com/roc-lang/roc/issues/6927 ?
#6965 is broader, are invisible characters the only problematic ones?
#6928 had someone express interest in working on it, but has been stale for some time.
would I be able to take over the issue?
I'd say go for it... I haven't seen anything from hrishisd for some time.
It's possible that parts of PR#7730 could be used here as well.
As a complete newbie to the codebase, I am a little confused about the relationship between rust and zig. After perusing the zig src code, I have a pretty good idea of how the parsing/tokenization works and how I would implement a solution to #6982, but the PR you linked touches only rust parsing code.
How do the two different langs interact in this?
Thanks
How do the two different langs interact in this?
The two languages don't interact. They are two different compilers. Though you can of course reference and port some algorithms from rust to zig.
One minor exceptions. Roc's builtins are written in zig even for the rust compiler.
Should I implement my PR for both compilers?
Rust compiler can be considered deprecated. We will still accept bug fixes, but it isn't actively being worked on anymore.
So Zig is where to implement things now.
Though in some cases, that may mean delaying for a while (cause the zig compiler is mostly a parser right now and has a way to go before it will be ready).
Is #6928 as simple as adding a check in src/check/parse/tokenize.zig::tokenizeStringLikeLiteralBody and pushing a diagnostic message right?
Not sure the exact function, but roughly that sounds correct.
Thanks!
I opened a draft pr (PR#7763), but I have a couple of questions,
repro-tokenize binary to ensure that my diagnostic was received by the tokenizer.Should it store some state while iterating through the characters, and combine multiple consecutive non-printable chars into a single diagnostic?
I think diagnostics for an entire string in one may be fine assuming it is readable. Others may have differing opinions
Is there anything else I need to do to implement a new diagnostic type other than just adding a new enum variant?
For now, no. We don't have good printing of errors yet.
Are there any display concerns with pointing the diagnostic at solely non-printable chars? Are they escaped when displayed to the user?
Yeah, they don't get displayed yet
Last updated: Jun 16 2026 at 16:19 UTC