Stream: beginners

Topic: UTF-8 Foreign Language Characters


view this post on Zulip Brian Teague (Feb 02 2024 at 20:28):

I was playing around with single character type.

What would be the best way to go from 12363 Int* back to String or Char? Should 'か' default to List U8 instead of Int *?

x = 'か'
12363 : Int *

Str.fromUtf8 [227, 129, 139]
Ok "か" : Result Str

view this post on Zulip Brendan Hansknecht (Feb 02 2024 at 20:40):

Ah, cause we don't have Str.appendScalar anymore.

view this post on Zulip Brendan Hansknecht (Feb 02 2024 at 20:41):

@Richard Feldman I don't quite understand why we removed Str.appendScalar. It feels like an important primitive for using characters with Strings. I don't think it falls into the same complexity as other Unicode function and it can return a result to avoid errors with invalid Unicode scalar.

That or at least need some way to go from a character literal to a string literal

view this post on Zulip Brendan Hansknecht (Feb 02 2024 at 20:43):

I guess currently the best option would be to store as a string and use Str.concat

view this post on Zulip Brendan Hansknecht (Feb 02 2024 at 20:45):

@Brian Teague what are you actually trying to do?

view this post on Zulip Brian Teague (Feb 02 2024 at 21:02):

Nothing actually productive. I'm just trying to learn the specifics of ROC's implementation. I mean the easiest thing to do is Str.toUtf8 "か", so maybe a better question is why not treat everything as a Str instead of Int unless there is a specific use case for single characters integers?

view this post on Zulip Brendan Hansknecht (Feb 02 2024 at 21:21):

I guess the larger ones are problematic. The U8 ones are great for pattern matching.

Before it was useful cause you could convert a string into a list of scalars (I32) and then match on any of these values.

view this post on Zulip Luke Boswell (Feb 02 2024 at 21:55):

We just need to add it in roc-lang/unicode. The module is there just needs some love.

view this post on Zulip Richard Feldman (Feb 02 2024 at 22:20):

part of the motivation for removing it is to make it more obvious that in practical scenarios, you should either be working in terms of Str or in terms of List U8 99.99% of the time, and doing anything at all with code point integers should be microscopically rare in practice

view this post on Zulip Richard Feldman (Feb 02 2024 at 22:22):

(other than the ones that overlap with ASCII, which comes up in parsing textual data formats like JSON and source code, in which case List U8 is definitely the right thing to reach for!)

view this post on Zulip Richard Feldman (Feb 02 2024 at 22:23):

someone pointed out in a comment somewhere (reddit I think?) that they weren't sure what Str was encouraging them to do in terms of these different primitives, and I think that criticism was valid

view this post on Zulip Richard Feldman (Feb 02 2024 at 22:24):

so I think there's value in not having any Str functions at all that work in terms of code points, and instead having all of that logic live in roc-lang/unicode

view this post on Zulip Brendan Hansknecht (Feb 02 2024 at 22:47):

I guess it just really weird having the 'か' literal then.

It can't be used with Str or List U8

view this post on Zulip Brendan Hansknecht (Feb 02 2024 at 22:49):

We just need to add it in roc-lang/unicode

I don't think that is the issue. Unicode is a power module for special use cases. Most users should never need to touch it. Adding a character to the end of a Str is not a special use case. We need to make sure there is a clear story of how that works.

Note, the clear story may be to remove literals like 'か' and require "か" instead. Then Str.concat just works.

view this post on Zulip Luke Boswell (Feb 02 2024 at 23:23):

I would use Str.concat mystr "か". It is helpful to have the codepoint literals though, so I would rather not lose that

view this post on Zulip Brendan Hansknecht (Feb 02 2024 at 23:24):

I'm not sold. It is really strange to have a literal type that doesn't work with any of the standard library.

view this post on Zulip Richard Feldman (Feb 03 2024 at 00:02):

I agree that it's strange

view this post on Zulip Richard Feldman (Feb 03 2024 at 00:03):

so then would single quotes only accept things that fit in U8?

view this post on Zulip Richard Feldman (Feb 03 2024 at 00:04):

(I think it's reasonable to try that and see if there's demand in practice for expanding it; I suspect there would be little or none)

view this post on Zulip Brendan Hansknecht (Feb 03 2024 at 00:09):

Yeah, I think that would make more sense 'c' for list U8 and "c" for string use.

view this post on Zulip Richard Feldman (Feb 03 2024 at 00:23):

oh I meant 'c' for U8

view this post on Zulip Richard Feldman (Feb 03 2024 at 00:23):

(maybe that's what you meant too though!)

view this post on Zulip Brendan Hansknecht (Feb 03 2024 at 00:25):

Yeah, sorry, I meant 'c' for use with List U8. The value would just be a U8.

view this post on Zulip Brian Teague (Feb 03 2024 at 02:31):

Richard Feldman said:

so then would single quotes only accept things that fit in U8?

If I understand you correctly, only convert chars to U8 if they fit, otherwise return a compile error?
An alternative is converting to List U8 if it doesn't fit in a U8, but I could see that leading to unexpected outcomes because of the different types single quote chars could return.

view this post on Zulip Brendan Hansknecht (Feb 03 2024 at 03:05):

Yeah, exactly that

view this post on Zulip Brendan Hansknecht (Feb 03 2024 at 03:05):

With the exact same concern for different values


Last updated: Jul 06 2025 at 12:14 UTC