Stream: beginners

Topic: Output formatting


view this post on Zulip Asier Elorz (he/him) (Dec 08 2022 at 09:29):

What is the most ergonomic way for formatting several objects of different types into a string to print it? Something like print! and format! macros in Rust, Python's str.format and fstrings, C++'s fmt library or std::format...

In Roc I end up writing a lot of code to just print a bunch of formatted integers:

diskSizeStr = Num.toStr diskSize
totalUsedSizeStr = Num.toStr totalUsedSize
freeSpaceStr = Num.toStr freeSpace
spaceToFreeStr = Num.toStr spaceToFree
sizeOfDirToDeleteStr = Num.toStr sizeOfDirToDelete

chain
    [ Stdout.line "Disk size:             \(diskSizeStr)"
    , Stdout.line "Total used size:       \(totalUsedSizeStr)"
    , Stdout.line "Free space:            \(freeSpaceStr)"
    , Stdout.line "Space to free:         \(spaceToFreeStr)"
    , Stdout.line "Size of dir to delete: \(sizeOfDirToDeleteStr)"
    ]

My main frustration is that string interpolation only takes arguments of type string and does not accept inline function calls, which means I have to create a constant for every value I want to print. Is there a more ergonomic way of converting to string and printing output?

view this post on Zulip Brendan Hansknecht (Dec 08 2022 at 16:53):

I think the answer is currently no, sadly.

view this post on Zulip Brendan Hansknecht (Dec 08 2022 at 16:54):

Long term we want to add some sort of debug and display abilities. They would hopefully be auto derived.

view this post on Zulip Brendan Hansknecht (Dec 08 2022 at 16:55):

Currently the only way to get a print out without conversion is to use dbg, but it is very new and has a number of issues.

view this post on Zulip Kevin Gillette (Dec 08 2022 at 18:10):

In the meantime, you could perhaps use an list of message and number records, which you map into the Stdout.line tasks.

view this post on Zulip Kevin Gillette (Dec 08 2022 at 18:14):

@Brendan Hansknecht I was going to open a thread about this as well. In the short and medium term, would it be alright if we special cased interpolation to accept numbers as well, presuming abilities would produce the same output (i.e. special cased implicit conversion using Num.toStr, which a later ability would presumably be implemented in terms of)?

If that sounds reasonable, I would be open to working on this.

view this post on Zulip Brendan Hansknecht (Dec 08 2022 at 18:26):

I think that would be pretty awesome. Though it may be best to just implement display at this point. I think we have all of the prerequisite pieces at this point.

Aside: we have encode that can encode things as json for example, so theoretically that could also be used to convert a large record into something printable. Doesn't move everything to one line, but would enable merging many variables into one printable thing.

view this post on Zulip Kevin Gillette (Dec 09 2022 at 04:41):

Doesn't move everything to one line

@Brendan Hansknecht can you clarify? Does encode currently produce multi-line output? [by default?]

I would imagine that a display ability would produce concise Roc literal syntax, i.e. some balance of value-unambiguous (though probably ignoring the difference between U8 and I128 by outputting 5 instead of 5u8), while also being familiar and readable to Roc developers? This seems to be a pretty modern, increasingly typical tradeoff, where if you care about the specific formatting, you opt-into a specific formatting in order to control the output.

The trick comes with strings like "5" (which display the same as 5), versus {a: 1, b: "hello"}, which presumably display as either literal syntax or {a: 1, b: hello}. I'm expecting that a record, list, or any other data structures (tags?) composed solely of display-able types could themselves automatically derive the display ability?

For cases where _guaranteed_ non-ambiguity is needed, we may want another ability. Consider {a: "1", b: "hello"}, which could perhaps be displayed as {a: "1", b: hello}, that is, rendering a as a quoted string because its contents could be misinterpreted as belonging to an entirely different type family (whereas "hello" can't be misinterpreted for anything other than a string in Roc).

view this post on Zulip Brendan Hansknecht (Dec 09 2022 at 04:47):

Oh, I meant when you write code it still would be 2 lines if you want extra context. You can't use it directly in string interpolation.

varStr = Encode.encode var
Stdout.line "My Thing: \(varStr)"

With display, we will theoretically make string interpolation use it by default. so you would be able to just write

Stdout.line "My Thing: \(var)"

view this post on Zulip Brendan Hansknecht (Dec 09 2022 at 04:51):

Hmm. That's a really good point on strings. I guess this is why we may need both a Display ability and Debug. One form pretty printing and another for debug. Of course by making an opaque type, you can always override either.

view this post on Zulip Brendan Hansknecht (Dec 09 2022 at 04:52):

I assume @Richard Feldman or someone else has maybe put a bit more thought into the api we want to support.

view this post on Zulip Brendan Hansknecht (Dec 09 2022 at 04:52):

but otherwise, I guess we just need to discuss the symantics of the api some.

view this post on Zulip Richard Feldman (Dec 09 2022 at 11:24):

yeah so I think there are two different use cases:

  1. Stringifying things for people using the software
  2. Stringifying things for people writing the software (e.g. putting strings in quotation marks)

view this post on Zulip Richard Feldman (Dec 09 2022 at 11:25):

I think every Roc type should automatically infer some implementation of the second thing - we'd previously discussed naming the ability Inspect - and I think dbg and expect should use that implementation to stringify values, so you can customize how they look in opaque types

view this post on Zulip Richard Feldman (Dec 09 2022 at 11:27):

e.g. if you made a custom data structure with its own custom Eq, and then you write a test that compares two of them, if that test fails it's not going to be super helpful to get a dump of the entire internal structure of the data structure on the screen - compared to (for example) having it shown as MyTree.fromList [1, 2, 3] or something like that (which the author of the data structure could achieve by implementing a custom Inspect for that opaque type, which would replace the default Inspect implementation that just returns "<opaque>" so as not to leak internal implementation details by default)

view this post on Zulip Richard Feldman (Dec 09 2022 at 11:29):

stringifying things for people using the software is a different use case; let's say we call it Display (I don't love that name because I think we'll want an ability in the future for rendering types in the editor - basically Inspect except instead of returning a Str it returns UI elements, and I think it would be strange to have an ability called Display and another called Render, and the first of those returns a Str...but still, I don't know of a name I like better, so let's call it Display for now since that's what Rust calls it)

view this post on Zulip Richard Feldman (Dec 09 2022 at 11:30):

I think Display should only be auto-inferred for strings and numbers, because I actively want a type mismatch if I accidentally try to render something for an end user that's really an internal-only type

view this post on Zulip Richard Feldman (Dec 09 2022 at 11:32):

for example, I don't think records should get Display because if I'm rendering a record directly to an end user, I'm presumably making a mistake - why would I want to display Roc code to my user? If it's because they're a programmer, fair enough - but in that unusual case I can use Inspect instead

view this post on Zulip Richard Feldman (Dec 09 2022 at 11:33):

we actually had a bug at work once because Elm had a toString function that accepted any value, and we accidentally passed it a function and it ended up silently doing the wrong thing

view this post on Zulip Richard Feldman (Dec 09 2022 at 11:34):

so I think Display should be automatically implemented for strings and numbers, and that's it by default - but if you want to implement it for your own custom opaque type, you can

view this post on Zulip Richard Feldman (Dec 09 2022 at 11:36):

for example, in the case of Path I want to do that; rendering paths to users is very normal, but under the hood paths are allowed to contain unprintable Unicode characters (because UNIX allows those characters in its filenames), so when displaying it to an end user, the idea would be to convert any invalid characters to the Unicode replacement character the way Rust does

view this post on Zulip Richard Feldman (Dec 09 2022 at 11:38):

I don't think we should try to hack in a "string interpolation accepts either strings or numbers for now" feature, because there's actually a deep type inference rabbit hole there; e.g. if someone puts this into the repl, you have to decide what it prints for the type:

\x -> "could be a string or number: \(x)"

view this post on Zulip Richard Feldman (Dec 09 2022 at 11:39):

so I'd rather we just directly spent the effort on making Display and making string interpolation accept something that has Display instead of a Str value

view this post on Zulip Kevin Gillette (Dec 09 2022 at 19:59):

All of that sounds quite reasonable to me, except, perhaps, regarding Path:

for example, in the case of Path I want to do that; rendering paths to users is very normal, but under the hood paths are allowed to contain unprintable Unicode characters (because UNIX allows those characters in its filenames), so when displaying it to an end user, the idea would be to convert any invalid characters to the Unicode replacement character the way Rust does

Wouldn't the right user experience be to convey to the user something that's unambiguous _and_ usable for all valid paths?

Except for systems which interpret paths as Unicode, we shouldn't do so either, unless we're quite confident that applications or OS handling will transparently re-encode UTF8 into latin-1 or whatever via collation rules. Even then we should only do so when there's an reversible transform available.

U+FFFD is not a reversible transform. We could encode such non-unicode characters with \xFF style escapes (or an equivalent): the user might not be able to copy-paste the result, but it is unambiguous, it hopefully won't come up often, and at least sufficiently technical users would be able to transform it into whatever they need.

Aside: Go has pretty clever (and frankly, simple) strings: a string is just an immutable array of bytes. Most stdlib APIs will treat it as a UTF-8 encoded byte sequence (and thus inject replacement characters and such), and default iteration over a string will yield a byte offset paired with a rune (codepoint).

However, your can treat any string as binary merely by walking over it using a byte-by-byte iteration (for pure ASCII strings this essentially has the same effect as regular iteration). You can mix and match \u and \U escapes alongside \x (hex) and \o (octal) escapes, and your hex escapes can of course result in either valid or non-valid UTF-8 sequences. Literal, unescaped characters are treated as utf-8 and the compiler rejects any regular string literals that produce invalid utf-8 except via escape.

In other words, you have full control, yet it's designed to make people fall into the pit of success of treating things as Unicode by default. I know people who've used Go for years but have never had to consume binary-containing strings who didn't know Go strings could do that yet have never been burned by the capability (or had the opportunity to notice) either. I have experienced or fathomed zero practical downsides to Go's approach of not getting in your way yet defaulting to Unicode.

In contrast, I've been using Python frequently for a decade and a half, and both 2.x and 3.x have separate binary and textual string types, and there's nothing at all that's straightforward or enjoyable about needing to deal with a mix of binary and Unicode in that language (or converting between them). When it comes to binary, Python is a draconian enforcer that will chase you forever and haunt your dreams, even in cases where there are entirely sensible defaults, and 3.x is often considered to have more annoying behavior than 2.x.

As such, I retain an open mind of course, but I've generally found languages that make a formal distinction between Unicode and binary strings to have been overly preoccupied with hypothetical risks (or a desire for precise and rigid type taxonomies) at the expense of practicality, ease of use, and trust in the competency of their programmers.

view this post on Zulip Richard Feldman (Dec 09 2022 at 20:58):

Wouldn't the right user experience be to convey to the user something that's unambiguous _and_ usable for all valid paths?

I think it's important to think about what the actual use cases are for displaying a path that might contain invalid characters.

A friend of mine mentioned that he used to have a large collection of old Japanese audio files where the filenames were encoded in an encoding that predated Unicode, and since they contained invalid Unicode, all sorts of programs would just blow up when trying to interact with them.

to me, the best user experience here is:

I don't think it's better to render escapes, because that would degrade the experience for valid filenames; they would need to be double-escaped. For example, valid UNIX paths can contain any byte except 0. So a UNIX path with \ is valid. If we introduce \ as an escape character, then to remain unambiguous, we have to replace any \s we encounter in paths with \\ to show that this was actually originally an \. But now we're taking a path that used to be valid and displaying it as a different path that's also valid from a Unicode perspective, but it's the incorrect path if they copy/paste it.

Given that all of these only come up in unusual edge cases, concern about this edge case seems reasonable, and this seems like a more error-prone user experience than showing replacement characters!

view this post on Zulip Richard Feldman (Dec 09 2022 at 21:03):

default iteration over a string will yield a byte offset paired with a rune (codepoint). [...] I have experienced or fathomed zero practical downsides to Go's approach of not getting in your way yet defaulting to Unicode.

what's your experience using this API with extended grapheme clusters?

view this post on Zulip Kevin Gillette (Dec 10 2022 at 03:34):

I've wondered about a bit about optimistic rendering, i.e. display a string with no special modifications nominally, but if any part of it is problematic, escalate to a descriptive rendering.

I've used this in the equivalent of a concise Debug formatting, specific to applications I've worked on, in particular allowing strings to be unquoted when they have safe characters but unquoted otherwise.

Less drastically, Python's repr of a string will produce double quoted output unless the input contains double quotes, in which case it'll produce single quoted output (unless the input _also_ contains single quotes, in which case it'll produce double quoted output with escaped internal quotes).

Go can be asked to format a string into a raw string literal (with the %#q Printf directive), but it'll produce a double quoted string if the input contains anything that can't be represented in a raw string literal (i.e. a backquote).

You had mentioned an aversion to variable syntax output, which I believe is the right choice at the language/stdlib level.

view this post on Zulip Kevin Gillette (Dec 10 2022 at 03:38):

what's your experience using this API with extended grapheme clusters?

@Richard Feldman Just so we're clear, can you reply with a string you have in mind, perhaps also in an escaped Roc literal form (in case any software does normalization or transforms), as well as a brief description of operations you'd want to see/evaluate in practice?

view this post on Zulip Richard Feldman (Dec 10 2022 at 04:03):

sure: "๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ"

ยป "๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ" |> Str.toScalars

[128105, 8205, 128105, 8205, 128102, 8205, 128102] : List U32

ยป "๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ" |> Str.graphemes
["๐Ÿ‘ฉโ€", "๐Ÿ‘ฉโ€", "๐Ÿ‘ฆโ€", "๐Ÿ‘ฆ"] : List Str

view this post on Zulip Richard Feldman (Dec 10 2022 at 04:15):

so this is:

view this post on Zulip Richard Feldman (Dec 10 2022 at 04:21):

ha, actually this is a bug! This should be 1 extended grapheme cluster

view this post on Zulip Richard Feldman (Dec 10 2022 at 04:21):

(I just double checked in Swift, which is what the design is based on)

view this post on Zulip Richard Feldman (Dec 10 2022 at 04:22):

I'll open an issue for that

view this post on Zulip Richard Feldman (Dec 10 2022 at 04:23):

anyway, this ended up being a good example of how easy this stuff is to get wrong in the presence of things like emojis which contain multiple code points

view this post on Zulip Richard Feldman (Dec 10 2022 at 04:26):

I think if an application author is using code points in any way, it is overwhelmingly likely that their code has bugs in the presence of emojis

view this post on Zulip Richard Feldman (Dec 10 2022 at 04:26):

code points are essentially never the right thing to reach for unless you are very specifically writing a low level library - like a Unicode tool or glyph renderer - and I think they should be as buried as possible in the standard library (and maybe not even exposed at all, honestly)

view this post on Zulip Richard Feldman (Dec 10 2022 at 04:29):

and I think the performance upside of slicing into arbitrary byte indices is locally better for performance within the context of that one operation, but globally worse because all sorts of other operations can no longer safely assume they're receiving valid UTF-8

view this post on Zulip Richard Feldman (Dec 10 2022 at 04:30):

which means they either need to defensively code around it or else give up memory safety (which is out of the question in Roc)

view this post on Zulip Richard Feldman (Dec 10 2022 at 04:32):

given that, I'm not sold any part of Go's design - the "it's just a sequence of bytes with no guarantees" means UTF-8 operations on them must all verify defensively, slowing them down more than I would expect direct byte indexing to make up for, and the "expose code points" is just asking for emoji bugs because code points are almost never the right unit to look at

view this post on Zulip Richard Feldman (Dec 10 2022 at 04:34):

incidentally the way I found out about extended grapheme clusters was that we had a big fire at work due to users inputting emojis that we weren't processing correctly because the original authors of a particular code path didn't know about these distinctions, and Ruby didn't offer a "pit of success" (which I think Swift does, and which I'm trying to make Roc's API do too!)

view this post on Zulip Kevin Gillette (Dec 10 2022 at 04:56):

https://go.dev/play/p/_WX_tMCTFmD

Indeed I see what you mean about iterating codepoints/scalars, and I agree with that. I was intending to convey my preference for the unicode/binary duality within a single type (using whatever the best interpretation of each aspect), not specifically the use of code-points.

view this post on Zulip Kevin Gillette (Dec 10 2022 at 04:59):

Unicode example where the graphemes happen to also be single codepoints: https://go.dev/play/p/otKTPn27DZy
Go's apparent extended-cluster non-awareness doesn't have an effect here.

view this post on Zulip Kevin Gillette (Dec 10 2022 at 05:06):

Binary data example (a 1x1 PNG): https://go.dev/play/p/MAdhG0jolZg

view this post on Zulip Kevin Gillette (Dec 10 2022 at 05:09):

Richard Feldman said:

code points are essentially never the right thing to reach for unless you are very specifically writing a low level library - like a Unicode tool or glyph renderer - and I think they should be as buried as possible in the standard library (and maybe not even exposed at all, honestly)

To make those third party libraries capable of interoperating with regular Roc strings, I'd imagine you'd still at least need to expose the Str.toUtf8 function? Otherwise, there's little opportunity for such a library to do low level processing.

view this post on Zulip Richard Feldman (Dec 10 2022 at 05:10):

yeah totally!

view this post on Zulip Richard Feldman (Dec 10 2022 at 05:11):

I think that's necessary for several reasons, e.g. to write strings to disk or to the network

view this post on Zulip Kevin Gillette (Dec 10 2022 at 05:15):

Richard Feldman said:

which means they either need to defensively code around it or else give up memory safety (which is out of the question in Roc)

Please elaborate on how this would be a memory safety concern? Roc strings are sized, and granting direct byte access (let's say with a hypothetical Str.getByte that behaves equivalently to List.get) would still need to obey that size. Sure, an incomplete low-level decoder might have unforeseen behavior if encountering a partial utf8 codepoint byte sequence (or an incomplete grapheme cluster sequence, for that matter), but from a language perspective, that behavior is defined: it can't peek into any memory that doesn't belong to the passed string.

view this post on Zulip Richard Feldman (Dec 10 2022 at 05:20):

some UTF-8 bytes indicate that "I am a multi-byte code point, so to determine the actual code point, you'll need to read the next byte"

view this post on Zulip Richard Feldman (Dec 10 2022 at 05:21):

if you end a "string" with one of those, then in the absence of defensive checking, something that's iterating and encounters one will read past the end of the string's memory and potentially segfault etc

view this post on Zulip Richard Feldman (Dec 10 2022 at 05:22):

defensive checking would mean verifying on every single multibyte code point that you're not about to overshoot the end of the string

view this post on Zulip Kevin Gillette (Dec 10 2022 at 05:27):

Richard Feldman said:

given that, I'm not sold any part of Go's design - the "it's just a sequence of bytes with no guarantees" means UTF-8 operations on them must all verify defensively, slowing them down more than I would expect direct byte indexing to make up for, and the "expose code points" is just asking for emoji bugs because code points are almost never the right unit to look at

You only need to be as defensive as the situation demands. If you want to process a Go string as bytes, it probably means that either:

  1. You very deliberately are processing data as binary, in which case there are no safety/correctness concerns by definition.
  2. You very deliberately are treating the string as ASCII (and ignoring anything with a value greater than 0x7f).

The pit of success for this in Go is that the iteration style almost exclusively used in the language (high-level range iteration) treats a string as Unicode. This iteration style is more concise, more convenient, and learned about first. I believe some review of open source Go code found that 97% of all loops in Go use range form. It's remarkably rare/unlikely that people fall into an "oops, I treated this textual string as bytes by accident" trap (actually, I believe that more people know about and are comfortable converting a string into an explicit array of bytes than there people who know about or remember off-hand the lower-level iteration form in Go). Most other parts of the language, likewise, also treat a string as unicode.

If we had a time machine, and could make Go better handle extended grapheme clusters (i.e. iterate clusters instead of code points), then in theory it'd have been better, though since Go is garbage collected and grapheme clusters would seem to be variable length, that may have still been a tenuous choice for the language compared to the performance tradeoffs that Roc makes (such as favoring refcounting most of the time and thus handling throwaway arrays more cheaply).

view this post on Zulip Kevin Gillette (Dec 10 2022 at 05:49):

Some things I'd like to ideally be able to do without jumping through _too_ many hoops:

  1. Read dubiously-encoded data. Your friend's filename example is one such case. Historical/archival processing is another. Data recovery (where you don't even know where a file semantically starts, or even where a block starts, beyond a 512 byte multiple (assuming you've got a disc/disk rather than a tape archive).
  2. Read data with a known encoding but which may have received transmission or persistence errors.
  3. Be able to loop unicode grapheme clusters until U+FFFD is encountered, and then cheaply/conveniently inspect the original bytes that were malformed, not just the UTF-8 encoding of U+FFFD.

These are certainly doable in Roc (and Roc would certainly do a better job of it than Python), but I'm guessing number 3 would probably take some work, i.e. maintaining parallel copies of the original data as List U8 (binary) and List Str (extended grapheme clusters), then walking the list of clusters until U+FFFD is encountered, doing some countUtf8Bytes arithmetic at each step along the way. When the replacement character is encountered, you'd then need to do some heuristics/detection on the List U8 to match it back up to subsequent clean cluster bytes to isolate the problem byte(s), while also needing to have a fair bit of UTF-8 encoding and unicode knowledge.

I could also see the above being done with Str.fromUtf8Range, though that can also waste a fair bit of compute by optimistically setting a 4-byte stride or pessimistically setting a 1-byte stride, though it certainly has the advantage of not needing a duplicate of what could be very large data.

In Go, you get #3 essentially for free: iteration gives you byte offsets as well as the decoded semantic data, so when you see a U+FFFD, you mark the start offset and keep iterating until you see something other than U+FFFD, mark the trailing offset, and then you have your broken byte range, all without the programmer really needing more than a cursory understanding of UTF-8, and all without needing to make guesses or heuristics (and without any need for a copy of the data).

view this post on Zulip Kevin Gillette (Dec 10 2022 at 05:54):

Even without a binary-in-strings capability, something like the following would effectively give Roc that forensic power described above...

decodeFirstGraphemeFromUtf8Bytes : List U8 -> Result { grapheme : Str, numBytes Nat } *

It would just be applied repeatedly, while trimming bytes off the front of the provided list in order to process more input.

view this post on Zulip Richard Feldman (Dec 10 2022 at 06:04):

yeah, and with seamless slices, we could go even further and give you back a slice of the original List U8 instead of a Nat - so you wouldn't have to pay for a bounds check to index into it again!

view this post on Zulip Kevin Gillette (Dec 10 2022 at 06:08):

Richard Feldman said:

if you end a "string" with one of those, then in the absence of defensive checking, something that's iterating and encounters one will read past the end of the string's memory and potentially segfault etc

Yep, I'm familiar with that. My confusion is just how that could happen in Roc, specifically. I wouldn't imagine that Roc would be deliberately made memory-unsafe in any scenario, so it could at most panic if attempting to read/decode past the end of a string, where I'm interpreting "string" as a known/enforced-length list of bytes that is expected to contain well-formed UTF-8. A [non-streaming] decoder that is not aware of the length of its input is arguably not sound or sufficiently robust, whereas a sound decoder should be able to stop, with a semantic error, upon reaching the end of the known-size byte array without attempting to blow past that end merely because it's in the middle of a byte sequence.

Sure, we can rely on memory-safe semantics to save us from a buffer overflow, but that still doesn't mean that such a decoder would be properly designed.

I don't see the choice of enforced utf8 vs optimistically utf8 as forcing an existential dilemma for the language. And for that matter, I'm guessing Roc does not scan every byte of every string a platform gives it at the moment it's received just to pre-screen for this potential issue, so already today (and foreseeably in the future, for performance reasons), I'd imagine I could trivially write a platform that gives Roc a 10 GiB binary "string" (I'd merely expect Roc to interpret a lot of replacement chars out of that data).

view this post on Zulip Kevin Gillette (Dec 10 2022 at 06:11):

Richard Feldman said:

yeah, and with seamless slices, we could go even further and give you back a slice of the original List U8 instead of a Nat - so you wouldn't have to pay for a bounds check to index into it again!

Or both! Sometimes you need a count of bytes processed, and it can be annoying to calculate bytes processed from the size change of the slice when it could've just been given to you (maybe you want to also want process what was decoded successfully, but as bytes :wink:, or maybe you want to display a progress bar)


Last updated: Jul 05 2025 at 12:14 UTC