Unicode code point syntax · ideas

Stream: ideas

Topic: Unicode code point syntax

Richard Feldman (Dec 25 2023 at 00:42):

currently, the syntax for inserting Unicode code points into string literals is \u(1234) where 1234 is a hexadecimal number representing the code point. For example:

"caf\u(e9)" == "café"

I recently realized there are two things I don't love about this:

It's not clear that this is hexadecimal, since there's no 0x marker (or similar) like there would be in integer literals. In the case of \u(e9) you have a clue because of the e, but in the case of something like \u(123) it wouldn't be clear at all. I like the idea of making it more clear from looking at it that it's hexadecimal versus something else.
The \u(...) syntax is not the easiest to remember. Some languages use \u{e9}, but that seems wrong for Roc. { never goes around integer literals anywhere else in the language, so the only reason to choose that would be that other languages do it.

Richard Feldman (Dec 25 2023 at 00:43):

putting these together, I kind of like the idea of just using the normal string interpolation syntax and having it accept number literals, e.g.

"caf\(0xe9)" == "café"
"caf\(233)" == "café"

Richard Feldman (Dec 25 2023 at 00:43):

this would be a bit weird in that, today, what goes inside the \(...) needs to have the type Str - and this would make it so that it could be either an expression of type Str or else a number literal

Richard Feldman (Dec 25 2023 at 00:45):

however, what I like about it is:

although it's weird, it takes about 500 milliseconds to learn, and would probably lead to zero bugs in practice ever
the type weirdness aside, it fits with the rest of the language in that it's a number literal going inside parentheses, which is a totally valid thing to put inside parentheses in Roc
it seems easier to remember than \u(...)

Brendan Hansknecht (Dec 25 2023 at 00:52):

I don't like it cause I think there will be many annoying moments where instead of getting a quick compiler error, a number you want to print will not be converted to a string.

Brendan Hansknecht (Dec 25 2023 at 00:52):

It will be interpreted as Unicode.

Brendan Hansknecht (Dec 25 2023 at 00:53):

I think it will just be a really common minor annoyance

Brendan Hansknecht (Dec 25 2023 at 00:55):

\(myNum)...compile....wait for execution to reach the statement...see some random Unicode character....be confused for a second and minorly annoyed...update code to be \(myNum |> Inspect.toStr)

Richard Feldman (Dec 25 2023 at 01:41):

oh I was specifically thinking of number literals for this

Richard Feldman (Dec 25 2023 at 01:42):

\(myNum) would expect that myNum has the type Str because it isn't a number literal

Richard Feldman (Dec 25 2023 at 01:43):

in general we can't safely interpolate integers into the middle of strings because they aren't necessarily Unicode scalar values

Richard Feldman (Dec 25 2023 at 01:43):

with number literals we can validate at compile time that they're valid

Brendan Hansknecht (Dec 25 2023 at 01:55):

Oh...yeah....should be fine. Though maybe minorly inconsistent or strange.

Richard Feldman (Dec 25 2023 at 02:02):

another option would be to keep the \u but have it be interpreted as a normal number literal, e.g. you'd need to do \u(0xe9)

Brendan Hansknecht (Dec 25 2023 at 02:37):

Could also go \(u0xe9) which might feel a bit more like normal interpolation, but u0x is definitely noisy looking

Richard Feldman (Dec 25 2023 at 02:41):

but u0xe9 is a valid variable name :sweat_smile:

Brendan Hansknecht (Dec 25 2023 at 02:41):

Ah... Yeah, nvm

Luke Boswell (Dec 25 2023 at 03:22):

What about something like \(U+12F0A1)? That aligns with the way codepoints are shown in the technical docs for unicode.

Elias Mulhall (Dec 25 2023 at 04:41):

Why not \u1234? That seems most consistent, considering other escape sequences like \n work how you'd expect.

Brian Carroll (Dec 25 2023 at 06:46):

I don't like it sharing the string interpolation syntax even if it's only literals. That's the kind of thing I would do when debugging. Why isn't this variable printing out? Ok I'll replace it with a literal just to make sure I know what's happening. Huh? Now it just turned into some weird character, what? Oh yeah... you can't just replace things with their values in Roc any more because of the Unicode syntax!

Brian Carroll (Dec 25 2023 at 06:46):

Much better to keep a different escape sequence

Brian Carroll (Dec 25 2023 at 06:49):

So \u123 is fine and I'm used to it from other places like JSON, but misses the potential improvement of making it clear that it's hex.

Brian Carroll (Dec 25 2023 at 06:49):

Maybe \x123

Brian Carroll (Dec 25 2023 at 06:51):

It's clearly an escape sequence, different from any other. The x tells you it's hex, and then what else would it be other than Unicode

Luke Boswell (Dec 25 2023 at 07:31):

I guess the challenge is how to know when the number ends and the rest of the string begins. With braces that's easy to parse

Brian Carroll (Dec 25 2023 at 08:25):

Fair enough, as long as it's distinct from string interpolation!

Richard Feldman (Dec 25 2023 at 21:07):

Brian Carroll said:

Why isn't this variable printing out? Ok I'll replace it with a literal just to make sure I know what's happening. Huh? Now it just turned into some weird character, what? Oh yeah... you can't just replace things with their values in Roc any more because of the Unicode syntax!

I think I should have said it explicitly, but this syntax is unambiguous and not in use today:

» "thing: \(123)"

── TYPE MISMATCH ───────────────────────────────────────────────────────────────

This argument to this string interpolation has an unexpected type:

4│      "thing: \(123)"
                  ^^^

The argument is a number of type:

    Num *

But this string interpolation needs its argument to be:

    Str

Richard Feldman (Dec 25 2023 at 21:08):

so the idea is to make this no longer a type mismatch, but rather to interpret it differently

Richard Feldman (Dec 25 2023 at 21:08):

such that \( followed by an integer literal followed by ) is treated differently by the parser

Richard Feldman (Dec 25 2023 at 21:11):

but anyway, the fact that this is surprising is probably predictive of how surprising it would be if this were the actual design :big_smile:

Oskar Hahn (Dec 25 2023 at 21:38):

Could there be a function with a short name, that converts a number to it's Unicode character as a string? Then you could write

"caf\(Str.u 0xe9)"

And there would not be the need for any special syntax in the language.

Richard Feldman (Dec 25 2023 at 22:42):

unfortunately this isn't possible because not all integers are valid unicode code points :big_smile:

Kevin Gillette (Dec 26 2023 at 02:05):

I would expect more readers to interpret "\(97)" as equivalent to "97" rather than "a".

The reason I believe this is:

String interpretation of _arbitrary expressions_ is becoming increasingly common in languages.
Few seem to have many or any restrictions on the type of expression that can be interpolated.
Language builtin types are typically robustly supported where interpolation is supported.
Numeric expressions generally format as base-10 when interpolated.

Kevin Gillette (Dec 26 2023 at 02:25):

As such, I believe it would be a mistake to have the "\(...)" syntax represent codepoints when provided as integer literals. It'd be a frequent surprise to learners, it'd be seen as a language idiosyncrasy, and I think we'd regularly field questions about this.

Some things I don't like about it in particular:

Especially being a PFP language, we absolutely should have uniform handling of expressions regardless of whether they're literals, variables, or operations. In other words, interpolation should have the property of referential transparency.
Following point 1, it's okay for type-check failures, i.e. "integers are not supported in string interpolations," but that should apply to both variables and literals equally.
I expect we'll capitulate at some point and just have meaningful representations of all builtin types, including numerics. When interpolated, it's non-controversial for integers to be represented in base-10, and probably non-controversial for frac values to be base-10 with a . fractional delimiter, or optionally scientific notation for large values. Alternate representations can explicitly convert to strings first (i.e. for locale formatting).

If we want to prohibit non-strings because we want to make the developer think carefully about formatting, there are various mechanisms we can use to encourage that (such as some opaque "Display" type wrapping a string), but we'll invariably get requests for some convenient way to represent arbitrary data for dumping into logs, or which otherwise would not be presented to end-users. A less restrictive "\?(123)" escape (i.e. to work with the Inspect module) could serve that purpose.

Kevin Gillette (Dec 26 2023 at 02:27):

For Unicode code points, I think "\u(123)" will work well enough and avoid the issues above, provided we can also specify "\u(someIntVar)" as well.

Kevin Gillette (Dec 26 2023 at 02:29):

(the variable approach could panic if it doesn't represent a valid codepoint, or fail a build if it's eligible for compile-time evaluation)

Brendan Hansknecht (Dec 26 2023 at 02:47):

I love the idea of using \?(expr) to automatically pipe through Inspect.toStr

Richard Feldman (Dec 26 2023 at 03:10):

Kevin Gillette said:

For Unicode code points, I think "\u(123)" will work well enough and avoid the issues above, provided we can also specify "\u(someIntVar)" as well.

\u(someIntVar) can't work because not all integers are valid code points

Richard Feldman (Dec 26 2023 at 03:11):

what about the idea of keeping \u(123) but having 123 interpreted as decimal instead of hexadecimal, like most number literals are? and then if you want hex, you do \u(0x123)

Richard Feldman (Dec 26 2023 at 03:12):

(as a transitionary step, we could try requiring 0x for awhile before opening it up to allow decimal)

Brendan Hansknecht (Dec 26 2023 at 03:28):

I like that the best so far of proposals. Essentially just clarify the literal base in Unicode defintions

Kevin Gillette (Dec 26 2023 at 04:33):

Richard Feldman said:

what about the idea of keeping \u(123) but having 123 interpreted as decimal instead of hexadecimal, like most number literals are? and then if you want hex, you do \u(0x123)

Yeah, I think that'll be an effective syntax, exactly as you describe.

Kevin Gillette (Dec 26 2023 at 04:33):

It'd be great to have some mechanism to interpolate a variable into a codepoint, even if that just ends up panicking or getting replaced by the Unicode replacement char when invalid (as if they had specified \u(0xfffd)).

We have panicking, checked, and other variants of some other operations (such as arithmetic), so it seems like there's a plausible path to doing the same here as well. Are strings special enough to require a wholly different philosophy to runtime validation than we apply to numerics?

Kevin Gillette (Dec 26 2023 at 04:37):

In any case, it sounds like a Roc programmer just needs to learn that everything that doesn't render as-is in a string starts with \, and all interpolations have something in parentheses. The interpretation/behavior of the interpolation depends on what immediately follows the \

Brian Carroll (Dec 26 2023 at 05:49):

I don't like the parentheses because that syntax strongly suggests that it should work the same as interpolation, and therefore an arbitrary expression should work. But that's misleading because we need to check the value.

Brian Carroll (Dec 26 2023 at 05:50):

\u123 and \uxab12 could work?

Brian Carroll (Dec 26 2023 at 05:53):

Or \u{123}

Brian Carroll (Dec 26 2023 at 05:54):

And \u{0x123}

Richard Feldman (Dec 26 2023 at 13:06):

the problem with \u123 is that there can be letters and/or numbers right after it in the string literal, and it's no longer possible to tell where the escape ends

Richard Feldman (Dec 26 2023 at 13:07):

other languages have done that and they all end up adding some delimited syntax because of that situation, so I think we should just have the (less error-prone) delimited syntax

Richard Feldman (Dec 26 2023 at 13:08):

a thing I don't like about \u{123} is that {123} is never valid Roc syntax, so it feels kind of arbitrary and unnecessarily hard to remember :sweat_smile:

Richard Feldman (Dec 26 2023 at 13:08):

although I appreciate the point about \u(123) looking like interpolation while not being interpolation

Richard Feldman (Dec 26 2023 at 13:09):

what about \u[123]? It looks like a list, and in fact it could even allow multiples

Richard Feldman (Dec 26 2023 at 13:09):

e.g. \u[123, 456]

Richard Feldman (Dec 26 2023 at 13:09):

although then maybe it once again looks like it should be interpolation haha

Brian Carroll (Dec 26 2023 at 13:12):

a thing I don't like about \u{123} is that {123} is never valid Roc syntax

Well we are talking about creating new syntax!
Sounds like you want to make it look like some other existing syntax whereas I was specifically trying to go in the opposite direction. Because again all other syntax allows literals and other expressions interchangeably.

Richard Feldman (Dec 26 2023 at 13:15):

yeah that's an interesting point :thinking:

Richard Feldman (Dec 26 2023 at 13:24):

a consideration about \u{123} specifically is that other languages use that exact syntax, but they interpret 123 as hexadecimal

Brian Carroll (Dec 26 2023 at 13:28):

How about \ud{123} for decimal and \ux{123} for hex with no option to leave the base unspecified?

Brian Carroll (Dec 26 2023 at 13:28):

Bit more verbose, I don't know if I like it or not, just occurred to me

Richard Feldman (Dec 26 2023 at 13:29):

or I guess could shorten that to \x{123} and \d{123}

Brian Carroll (Dec 26 2023 at 13:29):

ha ha I was just about to type the same thing

Brendan Hansknecht (Dec 26 2023 at 14:23):

Personally, I much prefer the \u(..) syntax specifically because it looks like interpolations

Brendan Hansknecht (Dec 26 2023 at 14:23):

I think keeping those similar is really nice

Brendan Hansknecht (Dec 26 2023 at 14:24):

Then it also makes adding anything syntax like \?(...) just fit (for automatically applying Inspect.toStr. Which I think is another win

Brendan Hansknecht (Dec 26 2023 at 14:25):

I am not a fan of \x or \d. I don't think users will as easily remember what they mean when they randomly see them in a codebase

Richard Feldman (Dec 26 2023 at 14:57):

let's do a separate thread about the \? idea to keep this thread focused on unicode :big_smile:

Richard Feldman (Dec 26 2023 at 14:58):

so here we have competing goals - obviously we can't pick a syntax that both looks like other valid Roc syntax while also not looking like it!

Richard Feldman (Dec 26 2023 at 14:59):

I'm curious what others think about that basic question - of whether looking like existing syntax is positive, negative, neutral, etc

Brendan Hansknecht (Dec 26 2023 at 15:07):

I think the idea of \?(...) even if we never add it to the language is important for this discussion. If we add \u{...} or \x{...}, I don't think that \?(...) would fit in. On the other hand, if we do \u(...). I think it would fit much more nicely. So I kinda see having one as making the other more justifiable in the language.

It would also suck to pick \u{...} here, then in the other thread realize we don't want \?{...} cause it is too different from other interpolation syntax, but we also don't want \?(...) cause it is not like any other syntax. I don't think we need to discuss \?(...) here. I just think it (and the general idea of possible more extensions with the same syntax) should be kept in minds as we pick this syntax.

Richard Feldman (Dec 26 2023 at 15:08):

hm maybe, but I'd still like to discuss the \? idea separately and reference it from this discussion, instead of combining both into one thread

Anton (Dec 26 2023 at 15:08):

I'm okay with \u(...), the u indicates something different is going on, that's good enough for me.

Kevin Gillette (Dec 26 2023 at 15:54):

I think \u(...) is the best option discussed so far because we've already set the tone with \(...).

The easiest thing for people to remember will be "\ and () always define an interpolation of some kind, and an optional 'modifier' character after the \ determines how the interpolation is interpreted."

For consistency, "the contents of the parentheses is always a valid Roc expression, and it's the modifier char alone that is used to determine what type checks, what expressions are allowed (the compiler will guide the programmer), and how that expression is formatted."

So far as relevant to this thread (not being discussed elsewhere):

\(...) is a display interpolation, which presently must type check to a string, but otherwise has no restrictions on expression.
\u(...) must type check to an int, and presently expressions are limited to positive untyped int literals (27i128 probably has no value being expressed). The integer is interpreted as a Unicode code point.

Some of these type and expression restrictions may be relaxed in the future, though since these are just Roc expressions, and since each formatting interpretation has a distinct modifier char, such an expansion would automatically be backwards-compatible.

Kevin Gillette (Dec 26 2023 at 16:43):

If we ever wanted a distinct behavior/interpretation, we should introduce a new modifier char or some multi-char modifier syntax. However, we should be frugal with our choices.

For example, \x(123) was briefly proposed as "interpret 123 as a hexadecimal Unicode code point." We should always ask ourselves whether a modifier char would better represent something else. Whether we want to provide this capability, the answer there is easy: hex formatting strings, List U8, and integers.

I propose this:

Each modifier char serves only one purpose and each purpose is served by at most one modifier char and its uppercase form. Unicode code points are served by \u, and so no other modifier char (not \x, not \d, nor anything else) may be used for variations on that same purpose. Conversely, \U is reserved for use as an alternate major mode for expressing Unicode code points.
We may never use \U or other uppercase modifiers, but if we do, we should have a good idea about what uppercase thematically means for modifiers in general before we introduce the first one, so that programmers can easily learn "the uppercase form of any modifier does this kind of thing," so that they can memorize the lowercase modifiers and a thematic rule, rather than needing to separately memorize both lowercase and uppercase modifiers. Rationale: imagine a keyboard where shift+d yielded "f".
In the design space, we operate as though we'll end up with a rich set of convenient, memorable modifier chars that each do something completely different. As such, we want to reserve the most memorable char for each potential purpose, and think carefully about whether the use we presently have in mind is worthy of the modifier char we propose for it. If we had first started discussing integer interpolation, and for whatever reason we had chosen to be pedantic and introduce \u for unsigned integers and perhaps \U for uppercasing string expressions, then we'd have stolen the best modifier chars away from the Unicode usecase. Even if we want to choose a path of having only a few things that can be interpolated, we should avoid diluting the set of memorable available chars (like \x) to fulfill minor/alternate purposes.
For minor variations beyond what is reasonable to use with the uppercase main modifier (\x vs \X), we introduce sub-modifiers (or a sub-format syntax of some kind) rather than top-level modifiers. For example, if we allow for customizable frac formatting, that could be \f for the default, with \fe or \f:e or whatever for scientific notation ("e" for exponent), rather than something like \e or \g. I'm not proposing that we introduce float formatting here, but am just proposing a general design principle.

Eli Dowling (Dec 27 2023 at 02:14):

I think \u(some expression) working is the most important factor. If you use the interpolation syntax you can't go changing the rules on folks. Yeah it only accepts something with type"int", just like the normal version only accepts a string, but it has to accept full expressions not just literals.
If you only accept literals, then you should change the syntax to \u{} or \u[] or some other thing to avoid confusion.

Isaac Van Doren (Dec 27 2023 at 05:37):

I agree. I would assume that any expression would be allowed based on the existing interpolation

Last updated: Jul 23 2026 at 13:15 UTC