UTF-8 Foreign Language Characters · beginners

What would be the best way to go from 12363 Int* back to String or Char? Should 'か' default to List U8 instead of Int *?

x = 'か'
12363 : Int *

Str.fromUtf8 [227, 129, 139]
Ok "か" : Result Str

Brendan Hansknecht (Feb 02 2024 at 20:40):

Brendan Hansknecht (Feb 02 2024 at 20:41):

@Richard Feldman I don't quite understand why we removed Str.appendScalar. It feels like an important primitive for using characters with Strings. I don't think it falls into the same complexity as other Unicode function and it can return a result to avoid errors with invalid Unicode scalar.

That or at least need some way to go from a character literal to a string literal

Brendan Hansknecht (Feb 02 2024 at 20:43):

I guess currently the best option would be to store か as a string and use Str.concat

Brendan Hansknecht (Feb 02 2024 at 20:45):

Brian Teague (Feb 02 2024 at 21:02):

Nothing actually productive. I'm just trying to learn the specifics of ROC's implementation. I mean the easiest thing to do is Str.toUtf8 "か", so maybe a better question is why not treat everything as a Str instead of Int unless there is a specific use case for single characters integers?

Brendan Hansknecht (Feb 02 2024 at 21:21):

I guess the larger ones are problematic. The U8 ones are great for pattern matching.

Before it was useful cause you could convert a string into a list of scalars (I32) and then match on any of these values.

Luke Boswell (Feb 02 2024 at 21:55):

We just need to add it in roc-lang/unicode. The module is there just needs some love.

Richard Feldman (Feb 02 2024 at 22:20):

part of the motivation for removing it is to make it more obvious that in practical scenarios, you should either be working in terms of Str or in terms of List U8 99.99% of the time, and doing anything at all with code point integers should be microscopically rare in practice

Richard Feldman (Feb 02 2024 at 22:22):

(other than the ones that overlap with ASCII, which comes up in parsing textual data formats like JSON and source code, in which case List U8 is definitely the right thing to reach for!)

Richard Feldman (Feb 02 2024 at 22:23):

someone pointed out in a comment somewhere (reddit I think?) that they weren't sure what Str was encouraging them to do in terms of these different primitives, and I think that criticism was valid

Richard Feldman (Feb 02 2024 at 22:24):

so I think there's value in not having any Str functions at all that work in terms of code points, and instead having all of that logic live in roc-lang/unicode

Brendan Hansknecht (Feb 02 2024 at 22:47):

Brendan Hansknecht (Feb 02 2024 at 22:49):

I don't think that is the issue. Unicode is a power module for special use cases. Most users should never need to touch it. Adding a character to the end of a Str is not a special use case. We need to make sure there is a clear story of how that works.

Note, the clear story may be to remove literals like 'か' and require "か" instead. Then Str.concat just works.

Luke Boswell (Feb 02 2024 at 23:23):

I would use Str.concat mystr "か". It is helpful to have the codepoint literals though, so I would rather not lose that

Brendan Hansknecht (Feb 02 2024 at 23:24):

I'm not sold. It is really strange to have a literal type that doesn't work with any of the standard library.

Richard Feldman (Feb 03 2024 at 00:02):

Richard Feldman (Feb 03 2024 at 00:03):

Richard Feldman (Feb 03 2024 at 00:04):

(I think it's reasonable to try that and see if there's demand in practice for expanding it; I suspect there would be little or none)

Brendan Hansknecht (Feb 03 2024 at 00:09):

Yeah, I think that would make more sense 'c' for list U8 and "c" for string use.

Richard Feldman (Feb 03 2024 at 00:23):

Brendan Hansknecht (Feb 03 2024 at 00:25):

Brian Teague (Feb 03 2024 at 02:31):

If I understand you correctly, only convert chars to U8 if they fit, otherwise return a compile error?
An alternative is converting to List U8 if it doesn't fit in a U8, but I could see that leading to unexpected outcomes because of the different types single quote chars could return.

Stream: beginners

Topic: UTF-8 Foreign Language Characters

Brian Teague (Feb 02 2024 at 20:28):

Brendan Hansknecht (Feb 02 2024 at 20:40):

Brendan Hansknecht (Feb 02 2024 at 20:41):

Brendan Hansknecht (Feb 02 2024 at 20:43):

Brendan Hansknecht (Feb 02 2024 at 20:45):

Brian Teague (Feb 02 2024 at 21:02):

Brendan Hansknecht (Feb 02 2024 at 21:21):

Luke Boswell (Feb 02 2024 at 21:55):

Richard Feldman (Feb 02 2024 at 22:20):

Richard Feldman (Feb 02 2024 at 22:22):

Richard Feldman (Feb 02 2024 at 22:23):

Richard Feldman (Feb 02 2024 at 22:24):

Brendan Hansknecht (Feb 02 2024 at 22:47):

Brendan Hansknecht (Feb 02 2024 at 22:49):

Luke Boswell (Feb 02 2024 at 23:23):

Brendan Hansknecht (Feb 02 2024 at 23:24):

Richard Feldman (Feb 03 2024 at 00:02):

Richard Feldman (Feb 03 2024 at 00:03):

Richard Feldman (Feb 03 2024 at 00:04):

Brendan Hansknecht (Feb 03 2024 at 00:09):

Richard Feldman (Feb 03 2024 at 00:23):

Richard Feldman (Feb 03 2024 at 00:23):

Brendan Hansknecht (Feb 03 2024 at 00:25):

Brian Teague (Feb 03 2024 at 02:31):

Brendan Hansknecht (Feb 03 2024 at 03:05):

Brendan Hansknecht (Feb 03 2024 at 03:05):