What is the most ergonomic way for formatting several objects of different types into a string to print it? Something like print! and format! macros in Rust, Python's str.format and fstrings, C++'s fmt library or std::format...
In Roc I end up writing a lot of code to just print a bunch of formatted integers:
diskSizeStr = Num.toStr diskSize
totalUsedSizeStr = Num.toStr totalUsedSize
freeSpaceStr = Num.toStr freeSpace
spaceToFreeStr = Num.toStr spaceToFree
sizeOfDirToDeleteStr = Num.toStr sizeOfDirToDelete
chain
[ Stdout.line "Disk size: \(diskSizeStr)"
, Stdout.line "Total used size: \(totalUsedSizeStr)"
, Stdout.line "Free space: \(freeSpaceStr)"
, Stdout.line "Space to free: \(spaceToFreeStr)"
, Stdout.line "Size of dir to delete: \(sizeOfDirToDeleteStr)"
]
My main frustration is that string interpolation only takes arguments of type string and does not accept inline function calls, which means I have to create a constant for every value I want to print. Is there a more ergonomic way of converting to string and printing output?
I think the answer is currently no, sadly.
Long term we want to add some sort of debug and display abilities. They would hopefully be auto derived.
Currently the only way to get a print out without conversion is to use dbg
, but it is very new and has a number of issues.
In the meantime, you could perhaps use an list of message and number records, which you map into the Stdout.line tasks.
@Brendan Hansknecht I was going to open a thread about this as well. In the short and medium term, would it be alright if we special cased interpolation to accept numbers as well, presuming abilities would produce the same output (i.e. special cased implicit conversion using Num.toStr, which a later ability would presumably be implemented in terms of)?
If that sounds reasonable, I would be open to working on this.
I think that would be pretty awesome. Though it may be best to just implement display at this point. I think we have all of the prerequisite pieces at this point.
Aside: we have encode that can encode things as json for example, so theoretically that could also be used to convert a large record into something printable. Doesn't move everything to one line, but would enable merging many variables into one printable thing.
Doesn't move everything to one line
@Brendan Hansknecht can you clarify? Does encode currently produce multi-line output? [by default?]
I would imagine that a display
ability would produce concise Roc literal syntax, i.e. some balance of value-unambiguous (though probably ignoring the difference between U8 and I128 by outputting 5
instead of 5u8
), while also being familiar and readable to Roc developers? This seems to be a pretty modern, increasingly typical tradeoff, where if you care about the specific formatting, you opt-into a specific formatting in order to control the output.
The trick comes with strings like "5"
(which display the same as 5
), versus {a: 1, b: "hello"}
, which presumably display as either literal syntax or {a: 1, b: hello}
. I'm expecting that a record, list, or any other data structures (tags?) composed solely of display-able types could themselves automatically derive the display ability?
For cases where _guaranteed_ non-ambiguity is needed, we may want another ability. Consider {a: "1", b: "hello"}
, which could perhaps be displayed as {a: "1", b: hello}
, that is, rendering a as a quoted string because its contents could be misinterpreted as belonging to an entirely different type family (whereas "hello" can't be misinterpreted for anything other than a string in Roc).
Oh, I meant when you write code it still would be 2 lines if you want extra context. You can't use it directly in string interpolation.
varStr = Encode.encode var
Stdout.line "My Thing: \(varStr)"
With display, we will theoretically make string interpolation use it by default. so you would be able to just write
Stdout.line "My Thing: \(var)"
Hmm. That's a really good point on strings. I guess this is why we may need both a Display
ability and Debug
. One form pretty printing and another for debug. Of course by making an opaque type, you can always override either.
I assume @Richard Feldman or someone else has maybe put a bit more thought into the api we want to support.
but otherwise, I guess we just need to discuss the symantics of the api some.
yeah so I think there are two different use cases:
I think every Roc type should automatically infer some implementation of the second thing - we'd previously discussed naming the ability Inspect
- and I think dbg
and expect
should use that implementation to stringify values, so you can customize how they look in opaque types
e.g. if you made a custom data structure with its own custom Eq
, and then you write a test that compares two of them, if that test fails it's not going to be super helpful to get a dump of the entire internal structure of the data structure on the screen - compared to (for example) having it shown as MyTree.fromList [1, 2, 3]
or something like that (which the author of the data structure could achieve by implementing a custom Inspect
for that opaque type, which would replace the default Inspect
implementation that just returns "<opaque>"
so as not to leak internal implementation details by default)
stringifying things for people using the software is a different use case; let's say we call it Display
(I don't love that name because I think we'll want an ability in the future for rendering types in the editor - basically Inspect
except instead of returning a Str
it returns UI elements, and I think it would be strange to have an ability called Display
and another called Render
, and the first of those returns a Str
...but still, I don't know of a name I like better, so let's call it Display
for now since that's what Rust calls it)
I think Display
should only be auto-inferred for strings and numbers, because I actively want a type mismatch if I accidentally try to render something for an end user that's really an internal-only type
for example, I don't think records should get Display
because if I'm rendering a record directly to an end user, I'm presumably making a mistake - why would I want to display Roc code to my user? If it's because they're a programmer, fair enough - but in that unusual case I can use Inspect
instead
we actually had a bug at work once because Elm had a toString
function that accepted any value, and we accidentally passed it a function and it ended up silently doing the wrong thing
so I think Display
should be automatically implemented for strings and numbers, and that's it by default - but if you want to implement it for your own custom opaque type, you can
for example, in the case of Path
I want to do that; rendering paths to users is very normal, but under the hood paths are allowed to contain unprintable Unicode characters (because UNIX allows those characters in its filenames), so when displaying it to an end user, the idea would be to convert any invalid characters to the Unicode replacement character the way Rust does
I don't think we should try to hack in a "string interpolation accepts either strings or numbers for now" feature, because there's actually a deep type inference rabbit hole there; e.g. if someone puts this into the repl, you have to decide what it prints for the type:
\x -> "could be a string or number: \(x)"
so I'd rather we just directly spent the effort on making Display
and making string interpolation accept something that has Display
instead of a Str
value
All of that sounds quite reasonable to me, except, perhaps, regarding Path:
for example, in the case of
Path
I want to do that; rendering paths to users is very normal, but under the hood paths are allowed to contain unprintable Unicode characters (because UNIX allows those characters in its filenames), so when displaying it to an end user, the idea would be to convert any invalid characters to the Unicode replacement character the way Rust does
Wouldn't the right user experience be to convey to the user something that's unambiguous _and_ usable for all valid paths?
Except for systems which interpret paths as Unicode, we shouldn't do so either, unless we're quite confident that applications or OS handling will transparently re-encode UTF8 into latin-1 or whatever via collation rules. Even then we should only do so when there's an reversible transform available.
U+FFFD is not a reversible transform. We could encode such non-unicode characters with \xFF
style escapes (or an equivalent): the user might not be able to copy-paste the result, but it is unambiguous, it hopefully won't come up often, and at least sufficiently technical users would be able to transform it into whatever they need.
Aside: Go has pretty clever (and frankly, simple) strings: a string is just an immutable array of bytes. Most stdlib APIs will treat it as a UTF-8 encoded byte sequence (and thus inject replacement characters and such), and default iteration over a string will yield a byte offset paired with a rune (codepoint).
However, your can treat any string as binary merely by walking over it using a byte-by-byte iteration (for pure ASCII strings this essentially has the same effect as regular iteration). You can mix and match \u and \U escapes alongside \x (hex) and \o (octal) escapes, and your hex escapes can of course result in either valid or non-valid UTF-8 sequences. Literal, unescaped characters are treated as utf-8 and the compiler rejects any regular string literals that produce invalid utf-8 except via escape.
In other words, you have full control, yet it's designed to make people fall into the pit of success of treating things as Unicode by default. I know people who've used Go for years but have never had to consume binary-containing strings who didn't know Go strings could do that yet have never been burned by the capability (or had the opportunity to notice) either. I have experienced or fathomed zero practical downsides to Go's approach of not getting in your way yet defaulting to Unicode.
In contrast, I've been using Python frequently for a decade and a half, and both 2.x and 3.x have separate binary and textual string types, and there's nothing at all that's straightforward or enjoyable about needing to deal with a mix of binary and Unicode in that language (or converting between them). When it comes to binary, Python is a draconian enforcer that will chase you forever and haunt your dreams, even in cases where there are entirely sensible defaults, and 3.x is often considered to have more annoying behavior than 2.x.
As such, I retain an open mind of course, but I've generally found languages that make a formal distinction between Unicode and binary strings to have been overly preoccupied with hypothetical risks (or a desire for precise and rigid type taxonomies) at the expense of practicality, ease of use, and trust in the competency of their programmers.
Wouldn't the right user experience be to convey to the user something that's unambiguous _and_ usable for all valid paths?
I think it's important to think about what the actual use cases are for displaying a path that might contain invalid characters.
A friend of mine mentioned that he used to have a large collection of old Japanese audio files where the filenames were encoded in an encoding that predated Unicode, and since they contained invalid Unicode, all sorts of programs would just blow up when trying to interact with them.
to me, the best user experience here is:
I don't think it's better to render escapes, because that would degrade the experience for valid filenames; they would need to be double-escaped. For example, valid UNIX paths can contain any byte except 0
. So a UNIX path with \
is valid. If we introduce \
as an escape character, then to remain unambiguous, we have to replace any \
s we encounter in paths with \\
to show that this was actually originally an \
. But now we're taking a path that used to be valid and displaying it as a different path that's also valid from a Unicode perspective, but it's the incorrect path if they copy/paste it.
Given that all of these only come up in unusual edge cases, concern about this edge case seems reasonable, and this seems like a more error-prone user experience than showing replacement characters!
default iteration over a string will yield a byte offset paired with a rune (codepoint). [...] I have experienced or fathomed zero practical downsides to Go's approach of not getting in your way yet defaulting to Unicode.
what's your experience using this API with extended grapheme clusters?
I've wondered about a bit about optimistic rendering, i.e. display a string with no special modifications nominally, but if any part of it is problematic, escalate to a descriptive rendering.
I've used this in the equivalent of a concise Debug formatting, specific to applications I've worked on, in particular allowing strings to be unquoted when they have safe characters but unquoted otherwise.
Less drastically, Python's repr of a string will produce double quoted output unless the input contains double quotes, in which case it'll produce single quoted output (unless the input _also_ contains single quotes, in which case it'll produce double quoted output with escaped internal quotes).
Go can be asked to format a string into a raw string literal (with the %#q
Printf directive), but it'll produce a double quoted string if the input contains anything that can't be represented in a raw string literal (i.e. a backquote).
You had mentioned an aversion to variable syntax output, which I believe is the right choice at the language/stdlib level.
what's your experience using this API with extended grapheme clusters?
@Richard Feldman Just so we're clear, can you reply with a string you have in mind, perhaps also in an escaped Roc literal form (in case any software does normalization or transforms), as well as a brief description of operations you'd want to see/evaluate in practice?
sure: "๐ฉโ๐ฉโ๐ฆโ๐ฆ"
ยป "๐ฉโ๐ฉโ๐ฆโ๐ฆ" |> Str.toScalars
[128105, 8205, 128105, 8205, 128102, 8205, 128102] : List U32
ยป "๐ฉโ๐ฉโ๐ฆโ๐ฆ" |> Str.graphemes
["๐ฉโ", "๐ฉโ", "๐ฆโ", "๐ฆ"] : List Str
so this is:
ha, actually this is a bug! This should be 1 extended grapheme cluster
(I just double checked in Swift, which is what the design is based on)
I'll open an issue for that
anyway, this ended up being a good example of how easy this stuff is to get wrong in the presence of things like emojis which contain multiple code points
I think if an application author is using code points in any way, it is overwhelmingly likely that their code has bugs in the presence of emojis
code points are essentially never the right thing to reach for unless you are very specifically writing a low level library - like a Unicode tool or glyph renderer - and I think they should be as buried as possible in the standard library (and maybe not even exposed at all, honestly)
and I think the performance upside of slicing into arbitrary byte indices is locally better for performance within the context of that one operation, but globally worse because all sorts of other operations can no longer safely assume they're receiving valid UTF-8
which means they either need to defensively code around it or else give up memory safety (which is out of the question in Roc)
given that, I'm not sold any part of Go's design - the "it's just a sequence of bytes with no guarantees" means UTF-8 operations on them must all verify defensively, slowing them down more than I would expect direct byte indexing to make up for, and the "expose code points" is just asking for emoji bugs because code points are almost never the right unit to look at
incidentally the way I found out about extended grapheme clusters was that we had a big fire at work due to users inputting emojis that we weren't processing correctly because the original authors of a particular code path didn't know about these distinctions, and Ruby didn't offer a "pit of success" (which I think Swift does, and which I'm trying to make Roc's API do too!)
https://go.dev/play/p/_WX_tMCTFmD
Indeed I see what you mean about iterating codepoints/scalars, and I agree with that. I was intending to convey my preference for the unicode/binary duality within a single type (using whatever the best interpretation of each aspect), not specifically the use of code-points.
Unicode example where the graphemes happen to also be single codepoints: https://go.dev/play/p/otKTPn27DZy
Go's apparent extended-cluster non-awareness doesn't have an effect here.
Binary data example (a 1x1 PNG): https://go.dev/play/p/MAdhG0jolZg
Richard Feldman said:
code points are essentially never the right thing to reach for unless you are very specifically writing a low level library - like a Unicode tool or glyph renderer - and I think they should be as buried as possible in the standard library (and maybe not even exposed at all, honestly)
To make those third party libraries capable of interoperating with regular Roc strings, I'd imagine you'd still at least need to expose the Str.toUtf8
function? Otherwise, there's little opportunity for such a library to do low level processing.
yeah totally!
I think that's necessary for several reasons, e.g. to write strings to disk or to the network
Richard Feldman said:
which means they either need to defensively code around it or else give up memory safety (which is out of the question in Roc)
Please elaborate on how this would be a memory safety concern? Roc strings are sized, and granting direct byte access (let's say with a hypothetical Str.getByte
that behaves equivalently to List.get
) would still need to obey that size. Sure, an incomplete low-level decoder might have unforeseen behavior if encountering a partial utf8 codepoint byte sequence (or an incomplete grapheme cluster sequence, for that matter), but from a language perspective, that behavior is defined: it can't peek into any memory that doesn't belong to the passed string.
some UTF-8 bytes indicate that "I am a multi-byte code point, so to determine the actual code point, you'll need to read the next byte"
if you end a "string" with one of those, then in the absence of defensive checking, something that's iterating and encounters one will read past the end of the string's memory and potentially segfault etc
defensive checking would mean verifying on every single multibyte code point that you're not about to overshoot the end of the string
Richard Feldman said:
given that, I'm not sold any part of Go's design - the "it's just a sequence of bytes with no guarantees" means UTF-8 operations on them must all verify defensively, slowing them down more than I would expect direct byte indexing to make up for, and the "expose code points" is just asking for emoji bugs because code points are almost never the right unit to look at
You only need to be as defensive as the situation demands. If you want to process a Go string as bytes, it probably means that either:
The pit of success for this in Go is that the iteration style almost exclusively used in the language (high-level range
iteration) treats a string as Unicode. This iteration style is more concise, more convenient, and learned about first. I believe some review of open source Go code found that 97% of all loops in Go use range
form. It's remarkably rare/unlikely that people fall into an "oops, I treated this textual string as bytes by accident" trap (actually, I believe that more people know about and are comfortable converting a string into an explicit array of bytes than there people who know about or remember off-hand the lower-level iteration form in Go). Most other parts of the language, likewise, also treat a string as unicode.
If we had a time machine, and could make Go better handle extended grapheme clusters (i.e. iterate clusters instead of code points), then in theory it'd have been better, though since Go is garbage collected and grapheme clusters would seem to be variable length, that may have still been a tenuous choice for the language compared to the performance tradeoffs that Roc makes (such as favoring refcounting most of the time and thus handling throwaway arrays more cheaply).
Some things I'd like to ideally be able to do without jumping through _too_ many hoops:
These are certainly doable in Roc (and Roc would certainly do a better job of it than Python), but I'm guessing number 3 would probably take some work, i.e. maintaining parallel copies of the original data as List U8
(binary) and List Str
(extended grapheme clusters), then walking the list of clusters until U+FFFD is encountered, doing some countUtf8Bytes
arithmetic at each step along the way. When the replacement character is encountered, you'd then need to do some heuristics/detection on the List U8
to match it back up to subsequent clean cluster bytes to isolate the problem byte(s), while also needing to have a fair bit of UTF-8 encoding and unicode knowledge.
I could also see the above being done with Str.fromUtf8Range, though that can also waste a fair bit of compute by optimistically setting a 4-byte stride or pessimistically setting a 1-byte stride, though it certainly has the advantage of not needing a duplicate of what could be very large data.
In Go, you get #3 essentially for free: iteration gives you byte offsets as well as the decoded semantic data, so when you see a U+FFFD, you mark the start offset and keep iterating until you see something other than U+FFFD, mark the trailing offset, and then you have your broken byte range, all without the programmer really needing more than a cursory understanding of UTF-8, and all without needing to make guesses or heuristics (and without any need for a copy of the data).
Even without a binary-in-strings capability, something like the following would effectively give Roc that forensic power described above...
decodeFirstGraphemeFromUtf8Bytes : List U8 -> Result { grapheme : Str, numBytes Nat } *
It would just be applied repeatedly, while trimming bytes off the front of the provided list in order to process more input.
yeah, and with seamless slices, we could go even further and give you back a slice of the original List U8
instead of a Nat
- so you wouldn't have to pay for a bounds check to index into it again!
Richard Feldman said:
if you end a "string" with one of those, then in the absence of defensive checking, something that's iterating and encounters one will read past the end of the string's memory and potentially segfault etc
Yep, I'm familiar with that. My confusion is just how that could happen in Roc, specifically. I wouldn't imagine that Roc would be deliberately made memory-unsafe in any scenario, so it could at most panic if attempting to read/decode past the end of a string, where I'm interpreting "string" as a known/enforced-length list of bytes that is expected to contain well-formed UTF-8. A [non-streaming] decoder that is not aware of the length of its input is arguably not sound or sufficiently robust, whereas a sound decoder should be able to stop, with a semantic error, upon reaching the end of the known-size byte array without attempting to blow past that end merely because it's in the middle of a byte sequence.
Sure, we can rely on memory-safe semantics to save us from a buffer overflow, but that still doesn't mean that such a decoder would be properly designed.
I don't see the choice of enforced utf8 vs optimistically utf8 as forcing an existential dilemma for the language. And for that matter, I'm guessing Roc does not scan every byte of every string a platform gives it at the moment it's received just to pre-screen for this potential issue, so already today (and foreseeably in the future, for performance reasons), I'd imagine I could trivially write a platform that gives Roc a 10 GiB binary "string" (I'd merely expect Roc to interpret a lot of replacement chars out of that data).
Richard Feldman said:
yeah, and with seamless slices, we could go even further and give you back a slice of the original
List U8
instead of aNat
- so you wouldn't have to pay for a bounds check to index into it again!
Or both! Sometimes you need a count of bytes processed, and it can be annoying to calculate bytes processed from the size change of the slice when it could've just been given to you (maybe you want to also want process what was decoded successfully, but as bytes :wink:, or maybe you want to display a progress bar)
Last updated: Jul 05 2025 at 12:14 UTC