Non-printable characters in `Str` · ideas

Stream: ideas

Topic: Non-printable characters in `Str`

Luke Boswell (Jul 24 2024 at 21:40):

Roc currently happily let's us include non-printable unicode characters in a roc Str. This can be confusing as in the below example -- where I had copied data from a CSV file and hadn't thought much deeper about it.

Luke Boswell said:

Here's an even smaller repro

module []

fromStr : Str -> _
fromStr = \raw ->
    if raw == "FOO" then FOO
    else if raw == "BAR" then BAR
    else if raw == "BAZ" then BAZ
    else OTHER

expect
    actual = ["FOO", "BAR","BAZ"] |> List.map fromStr
    expected = [FOO, BAR, BAZ]
    actual == expected

actual = [OTHER, BAR, BAZ]
expected = [FOO, BAR, BAZ]

See also this issue https://github.com/roc-lang/roc/issues/6919

Looking for ideas how we can avoid others suffering a similar fate? Or maybe this isn't something we should be concerned about? I definitely scratched my head for a few hours and didn't figure it out on my own. Big thanks to @Basile Henry

Joshua Warner (Jul 27 2024 at 21:38):

Ah - so the issue is that one or more of those strings has some invisible-ish characters in it?

I wonder if it would be enough to:

make the formatter write those out as unicode literals
maybe add a warning about it when parsing or canonicalizing

Brendan Hansknecht (Jul 27 2024 at 22:11):

That does sounds like a great simple solution.

And expect should use Inspect to print. And for strings we should have it escape non-prontable characters.

Luke Boswell (Jul 28 2024 at 00:07):

I'm going to make a couple of issues for these

Luke Boswell (Jul 28 2024 at 00:53):

The above are Good First Issues, if anyone is interested in dipping their toes into the roc compiler.

Brendan Hansknecht (Jul 28 2024 at 02:36):

Just to keep this with the other issue:
Modernize expect #6930

Brendan Hansknecht (Jul 28 2024 at 02:37):

This is the work required to get expect to use Inspect.
This is essentially a repeat of the work that was already done for debug, but a bit more complex.

Brendan Hansknecht (Jul 28 2024 at 02:37):

That said, the dbg PRs should show a lot of what needs to be done, which should be helpful.

Richard Feldman (Jul 28 2024 at 12:27):

we should do the same for repl value output too!

Basile Henry (Jul 28 2024 at 12:46):

Warn users if a Str literal contains invisible unicode characters

If Roc knows about the set of invisible characters, why not disallow them completely from String literals and redirect to the unicode escape sequence in the parsing error?
I don't know what the general policy on warnings is for Roc, but I typically prefer a stricter compiler and make warnings hard errors when possible.

Richard Feldman (Jul 28 2024 at 13:10):

sometimes you may actually need them to be in the string

Richard Feldman (Jul 28 2024 at 13:10):

there are reasons Unicode added them, after all :big_smile:

Richard Feldman (Jul 28 2024 at 13:11):

but they can make debug output confusing

Basile Henry (Jul 28 2024 at 13:16):

Sure, but that's what the unicode escape codes (or just escape sequences in general) are for, no? I doubt many programming languages accept any character in string literals. It would be very confusing if I was allowed to have a raw \r (not the escaped sequence) in the middle of a string literal.

Richard Feldman (Jul 28 2024 at 13:37):

I’m not sure if that’s significantly better than the “have the formatter translate them to escape sequences” design

Richard Feldman (Jul 28 2024 at 13:37):

seems like it would only be noticeable to people who aren’t using the formatter

Basile Henry (Jul 28 2024 at 13:49):

Rust doesn't allow it for reference:

$ rustc <(echo -e "fn main() { println!(\"hello\rworld\"); }") -o out
error: bare CR not allowed in string, use `\r` instead
 --> /dev/fd/63:1:28
  |
world"); }n() { println!("hello
  |                            ^ help: escape the character: `\r`

error: aborting due to 1 previous error

It seems to me that if it was allowed, you could hide some malicious functionality in your code that would be hidden on most terminals/text editors

Basile Henry (Jul 28 2024 at 13:51):

Whereas Roc does:

$ echo -e 'app [main] { pf: platform "https://github.com/roc-lang/basic-cli/releases/download/0.12.0/Lb8EgiejTUzbggO2HVVuPJFkwvvsfW6LojkLR20kTVE.tar.br" }

import pf.Stdout
import pf.Task

main =
    Stdout.line! "Hello\rWorld!"' > hello.roc
$ roc run hello.roc
World!

Richard Feldman (Jul 28 2024 at 13:52):

that's an interesting point - what do others think?

Basile Henry (Jul 28 2024 at 13:59):

Admittedly Rust wouldn't have caught @Luke Boswell 's issue: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=aa369ba6dbda01936972af7ad5acec17 (my rough translation)

Basile Henry (Jul 28 2024 at 14:03):

So maybe a formatter based solution does cover the most common accidental issues :+1:

Rasheed Starlet (Jul 28 2024 at 14:22):

Joshua Warner said:

Ah - so the issue is that one or more of those strings has some invisible-ish characters in it?

I wonder if it would be enough to:

make the formatter write those out as unicode literals

maybe add a warning about it when parsing or canonicalizing

Normalization ? https://www.unicode.org/faq/normalization.html

Brendan Hansknecht (Jul 28 2024 at 16:23):

Basile Henry said:

Rust doesn't allow it for reference:
$ rustc <(echo -e "fn main() { println!(\"hello\rworld\"); }") -o out
error: bare CR not allowed in string, use `\r` instead
 --> /dev/fd/63:1:28
  |
world"); }n() { println!("hello
  |                            ^ help: escape the character: `\r`

error: aborting due to 1 previous error
It seems to me that if it was allowed, you could hide some malicious functionality in your code that would be hidden on most terminals/text editors

I have a hard time imagining this being the case.

Like the code both looks normal enough that you trust it and run it, but has enough hidden unicode characters to do something malicious. (Also, you would see the warning during compilation, just wouldn't be strictly blocking.). So I don't think this hard edge is necessary.

Brendan Hansknecht (Jul 28 2024 at 16:25):

Rasheed Starlet said:

Normalization ? https://www.unicode.org/faq/normalization.html

I think Normalization is too expensive of a process to make it a default. Raw binary comparison has a lot of value. Users can always opt into Normalization if they need it.

Brendan Hansknecht (Jul 28 2024 at 16:27):

Also, why does Unicode have to have so many different ways to normalize.....

Oskar Hahn (Jul 28 2024 at 17:03):

Could the error messages show the size in bytes of a string? If you see "Foo"(6 bytes) you would easily see the problem. And maybe the size could also be helpful in other cases.

I think this would also have helped so find the bug quickly: "Foo"(3 bytes) is not equal "Foo"(6 bytes)

Richard Feldman (Jul 28 2024 at 17:11):

that might help in this case, but wouldn't help in the situation where you have the same total number of invisible characters, but they aren't in the same place

Richard Feldman (Jul 28 2024 at 17:12):

also I think it would be noise almost all the time, so I'd prefer a solution that only makes the output different in the rare cases where this might actually cause a problem :big_smile:

Aurélien Geron (Aug 01 2024 at 22:32):

An example of a security risk involving invisible characters is the infamous RTLO attack, where the Right-to-Left-Overide unicode character is used to make runnable code appear on the right, and look like a comment. There are also Unicode attacks involving visible but confusable characters, such as fake / in paths or URLs, fake letters in URLs or variable names, and so on. There's even an official page for confusables. For example, if you define the expeϲt function (with a fake c), who will suspect any problem?

image.png

IMHO, this deserves serious consideration. Code editors alleviate some of these issue (e.g., VSCode makes RTLO characters stand out very clearly, as well as some confusable characters), but you can't always rely on them.

That said, it's of course extremely useful to be able to copy/paste text that contains invisible characters (such as field separators, non-breakable spaces, RTL tags, escape codes, and more) directly into a code file, I do it all the time. Perhaps the compiler could be very strict (including rejecting confusables in names), but roc format could replace all the problematic characters with escape codes?

Anton (Aug 02 2024 at 10:27):

Good idea, we should restrict the characters for platform and package url strings as well

Anton (Aug 03 2024 at 17:53):

#6966
#6962
#6963
#6964
#6965

Luke Boswell (Aug 03 2024 at 20:07):

Is that last one a duplicate of https://github.com/roc-lang/roc/issues/6927 ?

Anton (Aug 05 2024 at 08:56):

#6965 is broader, are invisible characters the only problematic ones?

Jack Dyre (Apr 20 2025 at 15:28):

#6928 had someone express interest in working on it, but has been stale for some time.

would I be able to take over the issue?

Luke Boswell (Apr 21 2025 at 10:25):

I'd say go for it... I haven't seen anything from hrishisd for some time.

Anton (Apr 22 2025 at 09:36):

It's possible that parts of PR#7730 could be used here as well.

Jack Dyre (Apr 25 2025 at 03:41):

As a complete newbie to the codebase, I am a little confused about the relationship between rust and zig. After perusing the zig src code, I have a pretty good idea of how the parsing/tokenization works and how I would implement a solution to #6982, but the PR you linked touches only rust parsing code.

How do the two different langs interact in this?

Thanks

Brendan Hansknecht (Apr 25 2025 at 03:50):

How do the two different langs interact in this?

The two languages don't interact. They are two different compilers. Though you can of course reference and port some algorithms from rust to zig.

One minor exceptions. Roc's builtins are written in zig even for the rust compiler.

Jack Dyre (Apr 25 2025 at 03:50):

Should I implement my PR for both compilers?

Brendan Hansknecht (Apr 25 2025 at 03:51):

Rust compiler can be considered deprecated. We will still accept bug fixes, but it isn't actively being worked on anymore.

Brendan Hansknecht (Apr 25 2025 at 03:52):

So Zig is where to implement things now.

Brendan Hansknecht (Apr 25 2025 at 03:52):

Though in some cases, that may mean delaying for a while (cause the zig compiler is mostly a parser right now and has a way to go before it will be ready).

Jack Dyre (Apr 25 2025 at 03:53):

Is #6928 as simple as adding a check in src/check/parse/tokenize.zig::tokenizeStringLikeLiteralBody and pushing a diagnostic message right?

Brendan Hansknecht (Apr 25 2025 at 03:54):

Not sure the exact function, but roughly that sounds correct.

Jack Dyre (Apr 25 2025 at 03:54):

Thanks!

Jack Dyre (Apr 25 2025 at 18:55):

I opened a draft pr (PR#7763), but I have a couple of questions,

Should it store some state while iterating through the characters, and combine multiple consecutive non-printable chars into a single diagnostic?
Is there anything else I need to do to implement a new diagnostic type other than just adding a new enum variant?
Are there any display concerns with pointing the diagnostic at solely non-printable chars? Are they escaped when displayed to the user? I wasn't able to figure out how the diagnostics get presented to the user, I just used the repro-tokenize binary to ensure that my diagnostic was received by the tokenizer.

Brendan Hansknecht (Apr 26 2025 at 01:16):

Should it store some state while iterating through the characters, and combine multiple consecutive non-printable chars into a single diagnostic?

I think diagnostics for an entire string in one may be fine assuming it is readable. Others may have differing opinions

Is there anything else I need to do to implement a new diagnostic type other than just adding a new enum variant?

For now, no. We don't have good printing of errors yet.

Are there any display concerns with pointing the diagnostic at solely non-printable chars? Are they escaped when displayed to the user?

Yeah, they don't get displayed yet

Last updated: Jul 23 2026 at 13:15 UTC

Stream: ideas

Topic: Non-printable characters in `Str`

Luke Boswell (Jul 24 2024 at 21:40):

Joshua Warner (Jul 27 2024 at 21:38):

Brendan Hansknecht (Jul 27 2024 at 22:11):

Luke Boswell (Jul 28 2024 at 00:07):

Luke Boswell (Jul 28 2024 at 00:17):

Luke Boswell (Jul 28 2024 at 00:25):

Luke Boswell (Jul 28 2024 at 00:51):

Luke Boswell (Jul 28 2024 at 00:53):

Brendan Hansknecht (Jul 28 2024 at 02:36):

Brendan Hansknecht (Jul 28 2024 at 02:37):

Brendan Hansknecht (Jul 28 2024 at 02:37):

Richard Feldman (Jul 28 2024 at 12:27):

Basile Henry (Jul 28 2024 at 12:46):

Richard Feldman (Jul 28 2024 at 13:10):

Richard Feldman (Jul 28 2024 at 13:10):

Richard Feldman (Jul 28 2024 at 13:11):

Basile Henry (Jul 28 2024 at 13:16):

Richard Feldman (Jul 28 2024 at 13:37):

Richard Feldman (Jul 28 2024 at 13:37):

Basile Henry (Jul 28 2024 at 13:49):

Basile Henry (Jul 28 2024 at 13:51):

Richard Feldman (Jul 28 2024 at 13:52):

Basile Henry (Jul 28 2024 at 13:59):

Basile Henry (Jul 28 2024 at 14:03):

Rasheed Starlet (Jul 28 2024 at 14:22):

Brendan Hansknecht (Jul 28 2024 at 16:23):

Brendan Hansknecht (Jul 28 2024 at 16:25):

Brendan Hansknecht (Jul 28 2024 at 16:27):

Oskar Hahn (Jul 28 2024 at 17:03):

Richard Feldman (Jul 28 2024 at 17:11):

Richard Feldman (Jul 28 2024 at 17:12):

Aurélien Geron (Aug 01 2024 at 22:32):

Anton (Aug 02 2024 at 10:27):

Anton (Aug 03 2024 at 17:53):

Luke Boswell (Aug 03 2024 at 20:07):

Anton (Aug 05 2024 at 08:56):

Jack Dyre (Apr 20 2025 at 15:28):

Luke Boswell (Apr 21 2025 at 10:25):

Anton (Apr 22 2025 at 09:36):

Jack Dyre (Apr 25 2025 at 03:41):

Brendan Hansknecht (Apr 25 2025 at 03:50):

Jack Dyre (Apr 25 2025 at 03:50):

Brendan Hansknecht (Apr 25 2025 at 03:51):

Brendan Hansknecht (Apr 25 2025 at 03:52):

Brendan Hansknecht (Apr 25 2025 at 03:52):

Jack Dyre (Apr 25 2025 at 03:53):

Brendan Hansknecht (Apr 25 2025 at 03:54):

Jack Dyre (Apr 25 2025 at 03:54):

Jack Dyre (Apr 25 2025 at 18:55):

Brendan Hansknecht (Apr 26 2025 at 01:16):