currently, the syntax for inserting Unicode code points into string literals is \u(1234) where 1234 is a hexadecimal number representing the code point. For example:
"caf\u(e9)" == "café"
I recently realized there are two things I don't love about this:
0x marker (or similar) like there would be in integer literals. In the case of \u(e9) you have a clue because of the e, but in the case of something like \u(123) it wouldn't be clear at all. I like the idea of making it more clear from looking at it that it's hexadecimal versus something else.\u(...) syntax is not the easiest to remember. Some languages use \u{e9}, but that seems wrong for Roc. { never goes around integer literals anywhere else in the language, so the only reason to choose that would be that other languages do it.putting these together, I kind of like the idea of just using the normal string interpolation syntax and having it accept number literals, e.g.
"caf\(0xe9)" == "café"
"caf\(233)" == "café"
this would be a bit weird in that, today, what goes inside the \(...) needs to have the type Str - and this would make it so that it could be either an expression of type Str or else a number literal
however, what I like about it is:
\u(...)I don't like it cause I think there will be many annoying moments where instead of getting a quick compiler error, a number you want to print will not be converted to a string.
It will be interpreted as Unicode.
I think it will just be a really common minor annoyance
\(myNum)...compile....wait for execution to reach the statement...see some random Unicode character....be confused for a second and minorly annoyed...update code to be \(myNum |> Inspect.toStr)
oh I was specifically thinking of number literals for this
\(myNum) would expect that myNum has the type Str because it isn't a number literal
in general we can't safely interpolate integers into the middle of strings because they aren't necessarily Unicode scalar values
with number literals we can validate at compile time that they're valid
Oh...yeah....should be fine. Though maybe minorly inconsistent or strange.
another option would be to keep the \u but have it be interpreted as a normal number literal, e.g. you'd need to do \u(0xe9)
Could also go \(u0xe9) which might feel a bit more like normal interpolation, but u0x is definitely noisy looking
but u0xe9 is a valid variable name :sweat_smile:
Ah... Yeah, nvm
What about something like \(U+12F0A1)? That aligns with the way codepoints are shown in the technical docs for unicode.
Why not \u1234? That seems most consistent, considering other escape sequences like \n work how you'd expect.
I don't like it sharing the string interpolation syntax even if it's only literals. That's the kind of thing I would do when debugging. Why isn't this variable printing out? Ok I'll replace it with a literal just to make sure I know what's happening. Huh? Now it just turned into some weird character, what? Oh yeah... you can't just replace things with their values in Roc any more because of the Unicode syntax!
Much better to keep a different escape sequence
So \u123 is fine and I'm used to it from other places like JSON, but misses the potential improvement of making it clear that it's hex.
Maybe \x123
It's clearly an escape sequence, different from any other. The x tells you it's hex, and then what else would it be other than Unicode
I guess the challenge is how to know when the number ends and the rest of the string begins. With braces that's easy to parse
Fair enough, as long as it's distinct from string interpolation!
Brian Carroll said:
Why isn't this variable printing out? Ok I'll replace it with a literal just to make sure I know what's happening. Huh? Now it just turned into some weird character, what? Oh yeah... you can't just replace things with their values in Roc any more because of the Unicode syntax!
I think I should have said it explicitly, but this syntax is unambiguous and not in use today:
» "thing: \(123)"
── TYPE MISMATCH ───────────────────────────────────────────────────────────────
This argument to this string interpolation has an unexpected type:
4│ "thing: \(123)"
^^^
The argument is a number of type:
Num *
But this string interpolation needs its argument to be:
Str
so the idea is to make this no longer a type mismatch, but rather to interpret it differently
such that \( followed by an integer literal followed by ) is treated differently by the parser
but anyway, the fact that this is surprising is probably predictive of how surprising it would be if this were the actual design :big_smile:
Could there be a function with a short name, that converts a number to it's Unicode character as a string? Then you could write
"caf\(Str.u 0xe9)"
And there would not be the need for any special syntax in the language.
unfortunately this isn't possible because not all integers are valid unicode code points :big_smile:
I would expect more readers to interpret "\(97)" as equivalent to "97" rather than "a".
The reason I believe this is:
As such, I believe it would be a mistake to have the "\(...)" syntax represent codepoints when provided as integer literals. It'd be a frequent surprise to learners, it'd be seen as a language idiosyncrasy, and I think we'd regularly field questions about this.
Some things I don't like about it in particular:
. fractional delimiter, or optionally scientific notation for large values. Alternate representations can explicitly convert to strings first (i.e. for locale formatting).If we want to prohibit non-strings because we want to make the developer think carefully about formatting, there are various mechanisms we can use to encourage that (such as some opaque "Display" type wrapping a string), but we'll invariably get requests for some convenient way to represent arbitrary data for dumping into logs, or which otherwise would not be presented to end-users. A less restrictive "\?(123)" escape (i.e. to work with the Inspect module) could serve that purpose.
For Unicode code points, I think "\u(123)" will work well enough and avoid the issues above, provided we can also specify "\u(someIntVar)" as well.
(the variable approach could panic if it doesn't represent a valid codepoint, or fail a build if it's eligible for compile-time evaluation)
I love the idea of using \?(expr) to automatically pipe through Inspect.toStr
Kevin Gillette said:
For Unicode code points, I think
"\u(123)"will work well enough and avoid the issues above, provided we can also specify"\u(someIntVar)"as well.
\u(someIntVar) can't work because not all integers are valid code points
what about the idea of keeping \u(123) but having 123 interpreted as decimal instead of hexadecimal, like most number literals are? and then if you want hex, you do \u(0x123)
(as a transitionary step, we could try requiring 0x for awhile before opening it up to allow decimal)
I like that the best so far of proposals. Essentially just clarify the literal base in Unicode defintions
Richard Feldman said:
what about the idea of keeping
\u(123)but having123interpreted as decimal instead of hexadecimal, like most number literals are? and then if you want hex, you do\u(0x123)
Yeah, I think that'll be an effective syntax, exactly as you describe.
It'd be great to have some mechanism to interpolate a variable into a codepoint, even if that just ends up panicking or getting replaced by the Unicode replacement char when invalid (as if they had specified \u(0xfffd)).
We have panicking, checked, and other variants of some other operations (such as arithmetic), so it seems like there's a plausible path to doing the same here as well. Are strings special enough to require a wholly different philosophy to runtime validation than we apply to numerics?
In any case, it sounds like a Roc programmer just needs to learn that everything that doesn't render as-is in a string starts with \, and all interpolations have something in parentheses. The interpretation/behavior of the interpolation depends on what immediately follows the \
I don't like the parentheses because that syntax strongly suggests that it should work the same as interpolation, and therefore an arbitrary expression should work. But that's misleading because we need to check the value.
\u123 and \uxab12 could work?
Or \u{123}
And \u{0x123}
the problem with \u123 is that there can be letters and/or numbers right after it in the string literal, and it's no longer possible to tell where the escape ends
other languages have done that and they all end up adding some delimited syntax because of that situation, so I think we should just have the (less error-prone) delimited syntax
a thing I don't like about \u{123} is that {123} is never valid Roc syntax, so it feels kind of arbitrary and unnecessarily hard to remember :sweat_smile:
although I appreciate the point about \u(123) looking like interpolation while not being interpolation
what about \u[123]? It looks like a list, and in fact it could even allow multiples
e.g. \u[123, 456]
although then maybe it once again looks like it should be interpolation haha
a thing I don't like about \u{123} is that {123} is never valid Roc syntax
Well we are talking about creating new syntax!
Sounds like you want to make it look like some other existing syntax whereas I was specifically trying to go in the opposite direction. Because again all other syntax allows literals and other expressions interchangeably.
yeah that's an interesting point :thinking:
a consideration about \u{123} specifically is that other languages use that exact syntax, but they interpret 123 as hexadecimal
How about \ud{123} for decimal and \ux{123} for hex with no option to leave the base unspecified?
Bit more verbose, I don't know if I like it or not, just occurred to me
or I guess could shorten that to \x{123} and \d{123}
ha ha I was just about to type the same thing
Personally, I much prefer the \u(..) syntax specifically because it looks like interpolations
I think keeping those similar is really nice
Then it also makes adding anything syntax like \?(...) just fit (for automatically applying Inspect.toStr. Which I think is another win
I am not a fan of \x or \d. I don't think users will as easily remember what they mean when they randomly see them in a codebase
let's do a separate thread about the \? idea to keep this thread focused on unicode :big_smile:
so here we have competing goals - obviously we can't pick a syntax that both looks like other valid Roc syntax while also not looking like it!
I'm curious what others think about that basic question - of whether looking like existing syntax is positive, negative, neutral, etc
I think the idea of \?(...) even if we never add it to the language is important for this discussion. If we add \u{...} or \x{...}, I don't think that \?(...) would fit in. On the other hand, if we do \u(...). I think it would fit much more nicely. So I kinda see having one as making the other more justifiable in the language.
It would also suck to pick \u{...} here, then in the other thread realize we don't want \?{...} cause it is too different from other interpolation syntax, but we also don't want \?(...) cause it is not like any other syntax. I don't think we need to discuss \?(...) here. I just think it (and the general idea of possible more extensions with the same syntax) should be kept in minds as we pick this syntax.
hm maybe, but I'd still like to discuss the \? idea separately and reference it from this discussion, instead of combining both into one thread
I'm okay with \u(...), the u indicates something different is going on, that's good enough for me.
I think \u(...) is the best option discussed so far because we've already set the tone with \(...).
The easiest thing for people to remember will be "\ and () always define an interpolation of some kind, and an optional 'modifier' character after the \ determines how the interpolation is interpreted."
For consistency, "the contents of the parentheses is always a valid Roc expression, and it's the modifier char alone that is used to determine what type checks, what expressions are allowed (the compiler will guide the programmer), and how that expression is formatted."
So far as relevant to this thread (not being discussed elsewhere):
\(...) is a display interpolation, which presently must type check to a string, but otherwise has no restrictions on expression.\u(...) must type check to an int, and presently expressions are limited to positive untyped int literals (27i128 probably has no value being expressed). The integer is interpreted as a Unicode code point.Some of these type and expression restrictions may be relaxed in the future, though since these are just Roc expressions, and since each formatting interpretation has a distinct modifier char, such an expansion would automatically be backwards-compatible.
If we ever wanted a distinct behavior/interpretation, we should introduce a new modifier char or some multi-char modifier syntax. However, we should be frugal with our choices.
For example, \x(123) was briefly proposed as "interpret 123 as a hexadecimal Unicode code point." We should always ask ourselves whether a modifier char would better represent something else. Whether we want to provide this capability, the answer there is easy: hex formatting strings, List U8, and integers.
I propose this:
\u, and so no other modifier char (not \x, not \d, nor anything else) may be used for variations on that same purpose. Conversely, \U is reserved for use as an alternate major mode for expressing Unicode code points.\U or other uppercase modifiers, but if we do, we should have a good idea about what uppercase thematically means for modifiers in general before we introduce the first one, so that programmers can easily learn "the uppercase form of any modifier does this kind of thing," so that they can memorize the lowercase modifiers and a thematic rule, rather than needing to separately memorize both lowercase and uppercase modifiers. Rationale: imagine a keyboard where shift+d yielded "f".\u for unsigned integers and perhaps \U for uppercasing string expressions, then we'd have stolen the best modifier chars away from the Unicode usecase. Even if we want to choose a path of having only a few things that can be interpolated, we should avoid diluting the set of memorable available chars (like \x) to fulfill minor/alternate purposes.\x vs \X), we introduce sub-modifiers (or a sub-format syntax of some kind) rather than top-level modifiers. For example, if we allow for customizable frac formatting, that could be \f for the default, with \fe or \f:e or whatever for scientific notation ("e" for exponent), rather than something like \e or \g. I'm not proposing that we introduce float formatting here, but am just proposing a general design principle.I think \u(some expression) working is the most important factor. If you use the interpolation syntax you can't go changing the rules on folks. Yeah it only accepts something with type"int", just like the normal version only accepts a string, but it has to accept full expressions not just literals.
If you only accept literals, then you should change the syntax to \u{} or \u[] or some other thing to avoid confusion.
I agree. I would assume that any expression would be allowed based on the existing interpolation
Last updated: Jun 16 2026 at 16:19 UTC