How to get String length? · beginners

This is an extremely beginners' question, but I can't figure out how to get the length of a string in Roc. Can someone help please?

Anton (Apr 23 2024 at 10:14):

Hi @Abhinav Sarkar,
If you only expect your Str to contain ascii characters, you can use List.len (Str.toUtf8 myStr). For an in depth treatment of this suprisingly complicated problem I recommend reading the sections Unicde, Graphemes, Code Points... near the top of the Str docs.

Abhinav Sarkar (Apr 23 2024 at 10:15):

I'd like to handle Unicode too. I read the docs. It recommends using the roc-unicode package, but it does not have any releases.

Abhinav Sarkar (Apr 23 2024 at 10:16):

And I don't know how to use a package from Github without releases. Can't find any doc on that.

Anton (Apr 23 2024 at 10:42):

Do we have something like countGraphemes in roc-unicode that works @Luke Boswell?
Can you tell us a bit more about why you need the length of a string @Abhinav Sarkar? That may allow me to give some specific suggestions.

Abhinav Sarkar (Apr 23 2024 at 10:44):

I'm writing a pretty printing algorithm and it requires me to know how many columns a strings takes when printed. If I understand correctly, that's the grapheme count of string.

Luke Boswell (Apr 23 2024 at 10:48):

Yeah, Grapheme.split from roc-lang/unicode can be used to do this with unicode.

Luke Boswell (Apr 23 2024 at 10:49):

We don't have a release so you currently have to clone that repo (it's tiny) and use a local reference

Anton (Apr 23 2024 at 10:50):

Abhinav Sarkar (Apr 23 2024 at 10:51):

Luke Boswell (Apr 23 2024 at 10:51):

Luke Boswell (Apr 23 2024 at 10:52):

Richard Feldman (Apr 23 2024 at 11:07):

I think for a pretty printing "column" count, grapheme count makes the most sense

Richard Feldman (Apr 23 2024 at 11:09):

if you're in a monospace font, 1 grapheme will more often correspond to 1 "character width" than anything else will (including code points), although it's still not guaranteed

Richard Feldman (Apr 23 2024 at 11:10):

Abhinav Sarkar (Apr 23 2024 at 11:10):

I'd like to be as accurate as possible. Any suggestion for dealing with such cases?

Richard Feldman (Apr 23 2024 at 11:12):

I actually just tried this out (I asked chatGPT for some concrete examples of them) and they seem to all render the same width in my terminal:

Richard Feldman (Apr 23 2024 at 11:12):

Abhinav Sarkar (Apr 23 2024 at 11:12):

Richard Feldman (Apr 23 2024 at 11:13):

if you wanted to be maximally accurate, I think you'd need to get specific fonts involved and measure their glyph widths, which probably doesn't make sense for a pretty printer :big_smile:

Richard Feldman (Apr 23 2024 at 11:13):

I can't think of a measurement that's font-agnostic and would work better than grapheme count

Abhinav Sarkar (Apr 23 2024 at 11:16):

I rather not deal with fonts. I'll assume monospace. Let me figure out how to get grapheme count.

Abhinav Sarkar (Apr 23 2024 at 11:16):

Also a feedback as a noob, it seems really wrong to go through all these hoops to get a string's length.

Hristo (Apr 23 2024 at 11:40):

I could be misunderstanding but I think the "issue" here isn't a Roc-specific one. It might appear Roc is asking you to go through extra hoops, but that's probably because other languages in some cases gloss over things a bit too much, and make assumptions on behalf of the user.

Whether this is an indication that there could exist a tiny bit more user-friendly library in user space - probably, yes, but it'll come with caveats, as there aren't any shortcuts which could be taken without compromise in this case.

Abhinav Sarkar (Apr 23 2024 at 11:43):

Sure, but in my life I've never seen a single language that has a String data type and does not have a length function in it. TBH, I was very surprised with Roc's choice. Just my 2c.

Richard Feldman (Apr 23 2024 at 11:48):

Richard Feldman (Apr 23 2024 at 11:52):

this flow will be smoother when we have a release of roc-unicode, to be fair :sweat_smile:

Luke Boswell (Apr 23 2024 at 11:53):

I've never seen a language like it either. I was surprised too. But I'm finding there are lots of little details like this where a lot of care has gone into thinking about how to set people up for success. It's why I love using roc so much.

I'd love to improve the unicode library. If anyone is interested in helping with that then please let me know. A second pair of eyes to help make things more sensible would be amazing.

Hristo (Apr 23 2024 at 11:54):

@Abhinav Sarkar, my understanding is that one of the philosophies behind the design of Roc is to minimise (or eliminate altogether) the number of ways a developer could shoot themselves in the foot.
One of the directions for ensuring this is via not allowing use-case ambiguities and misinterpretation of semantics to be mistaken to be shortcomings of the language.
The language enables the user to truly attempt to identify and understand their use-case, and act accordingly.

Hristo (Apr 23 2024 at 11:57):

If there's is insufficient clarity or understanding in a use-case, then this signifies that the language cannot fill it in on behalf of the user, because at best that would be a guess. That's why the user is empowered to actually get to understand better the underlying problem they're trying to solve.

Again, user-space libraries do not have to share the same methodologies as the standard library, and many users might find functionality providing glossing-over of concepts (which in other languages may come as part of the standard library) available at that kind of level.

Abhinav Sarkar (Apr 23 2024 at 13:23):

Luke Boswell (Apr 23 2024 at 21:32):

Nothing stopping use from making a release today. Though it doesn't cover much of unicode yet, and hasn't been tested in great detail, so maybe it's not a good idea?? I dont know but I guess a release would indicate a lvel of maturity, and what is there is still a WIP though usable.

It took me quite a while to build the text segmentation for graphemes. It would probably be much easier now.

I was planning to do some fuzz testing with it, but keep getting distracted on other projects and work.

Brendan Hansknecht (Apr 24 2024 at 00:27):

Hannes (Apr 24 2024 at 14:27):

The visual width in a terminal is implemented using the wcwidth function, the python package has an explanation of it here: https://pypi.org/project/wcwidth/

I actually started working on a roc-wcwidth package a while ago for a project, but didn't finish it. I'll dig it up tomorrow and see if I can finish it.

Abhinav Sarkar (Apr 24 2024 at 15:15):

Brendan Hansknecht (Apr 24 2024 at 22:32):

Richard Feldman (Apr 24 2024 at 22:43):

Richard Feldman (Apr 24 2024 at 22:44):

you could always start with graphemes to get close and then switch to using this later, if you don't want to block on it

Luke Boswell (Apr 24 2024 at 22:46):

Luke Boswell (Apr 24 2024 at 22:47):

Given basic CLI will likely have tools built for a terminal, maybe we should expose that as a helper until we can write a pure roc implementation?

Luke Boswell (Apr 24 2024 at 22:50):

Actually, we might have most of the things we need in our unicode package already

Richard Feldman (Apr 24 2024 at 22:52):

Luke Boswell (Apr 24 2024 at 22:54):

From what I can tell, there is just the EastAsianWidth data file which maps codepoints to the width property. And then everything else is (neutral) and given a width of 1, except a handful of harcoded cases like em dash.

Luke Boswell (Apr 24 2024 at 22:55):

# The format is two fields separated by a semicolon.
# Field 0: Unicode code point value or range of code point values
# Field 1: East_Asian_Width property, consisting of one of the following values:
#         "A", "F", "H", "N", "Na", "W"
#  - All code points, assigned or unassigned, that are not listed
#      explicitly are given the value "N".

Luke Boswell (Apr 24 2024 at 22:57):

The Unicode Character Database [UCD] assigns to each Unicode character as its default width property one of six values: Ambiguous, Fullwidth, Halfwidth, Narrow, Wide, or Neutral (= Not East Asian). For any given operation, these six default property values resolve into only two property values, narrow and wide, depending on context.

Luke Boswell (Apr 24 2024 at 23:02):

Luke Boswell (Apr 24 2024 at 23:12):

We already have a few examples that do this in our package, so this would be quite easy to implement.

If someone would like to have a crack at this, we just need to add that data file to unicode/package/data, then write a InternalEAWGen.roc file that is almost a copy paste of InternalGBPGen.roc to parse the data file and generates a Roc file that maps CodePoints CP to an East Asian Width property EAW : [Ambiguous, Fullwidth, Halfwidth, Narrow, Neutral, Wide], and then implement a corresponding helper that uses this to walk through a List U8 or a Str and sum of the width.

interface InternalGBP
    exposes [GBP, fromCP, isExtend, isZWJ]
    imports [InternalCP.{ CP, toU32, fromU32Unchecked }]

GBP : [CR, Control, Extend, ZWJ, RI, Prepend, SpacingMark, V, T, LF, LVT, LV, L, Other]

isCR : U32 -> Bool
isCR = \u32 -> (u32 == 13)

# etc

timotree (Apr 28 2024 at 17:26):

Ian McLerran (Apr 28 2024 at 18:09):

Richard Feldman (Apr 28 2024 at 18:15):

I think this is a question about visible width - like how it will be printed in a terminal using a monospace font

Richard Feldman (Apr 28 2024 at 18:16):

this whole thread is reinforcing the decision to not include "string length" in the Str module - we've discussed like 5 different concepts of length, and all but one of them have turned out to be the wrong answer for the use case! :sweat_smile:

Ian McLerran (Apr 28 2024 at 18:36):

Clearly should have caught up on a little more context before dropping off topic responses! :woozy_face:

Stream: beginners

Topic: How to get String length?

Abhinav Sarkar (Apr 23 2024 at 10:05):

Anton (Apr 23 2024 at 10:14):

Abhinav Sarkar (Apr 23 2024 at 10:15):

Abhinav Sarkar (Apr 23 2024 at 10:16):

Anton (Apr 23 2024 at 10:42):

Abhinav Sarkar (Apr 23 2024 at 10:44):

Luke Boswell (Apr 23 2024 at 10:48):

Luke Boswell (Apr 23 2024 at 10:49):

Anton (Apr 23 2024 at 10:50):

Abhinav Sarkar (Apr 23 2024 at 10:51):

Luke Boswell (Apr 23 2024 at 10:51):

Luke Boswell (Apr 23 2024 at 10:52):

Richard Feldman (Apr 23 2024 at 11:07):

Richard Feldman (Apr 23 2024 at 11:09):

Richard Feldman (Apr 23 2024 at 11:10):

Abhinav Sarkar (Apr 23 2024 at 11:10):

Richard Feldman (Apr 23 2024 at 11:12):

Richard Feldman (Apr 23 2024 at 11:12):

Abhinav Sarkar (Apr 23 2024 at 11:12):

Richard Feldman (Apr 23 2024 at 11:13):

Richard Feldman (Apr 23 2024 at 11:13):

Richard Feldman (Apr 23 2024 at 11:13):

Abhinav Sarkar (Apr 23 2024 at 11:16):

Abhinav Sarkar (Apr 23 2024 at 11:16):

Hristo (Apr 23 2024 at 11:40):

Abhinav Sarkar (Apr 23 2024 at 11:43):

Richard Feldman (Apr 23 2024 at 11:48):

Richard Feldman (Apr 23 2024 at 11:52):

Richard Feldman (Apr 23 2024 at 11:52):

Luke Boswell (Apr 23 2024 at 11:53):

Hristo (Apr 23 2024 at 11:54):

Hristo (Apr 23 2024 at 11:57):

Abhinav Sarkar (Apr 23 2024 at 13:23):

Luke Boswell (Apr 23 2024 at 21:32):

Brendan Hansknecht (Apr 24 2024 at 00:27):

Hannes (Apr 24 2024 at 14:27):

Abhinav Sarkar (Apr 24 2024 at 15:15):

Brendan Hansknecht (Apr 24 2024 at 22:32):

Brendan Hansknecht (Apr 24 2024 at 22:32):

Richard Feldman (Apr 24 2024 at 22:43):

Richard Feldman (Apr 24 2024 at 22:44):

Luke Boswell (Apr 24 2024 at 22:46):

Luke Boswell (Apr 24 2024 at 22:47):

Luke Boswell (Apr 24 2024 at 22:50):

Richard Feldman (Apr 24 2024 at 22:52):

Luke Boswell (Apr 24 2024 at 22:54):

Luke Boswell (Apr 24 2024 at 22:55):

Luke Boswell (Apr 24 2024 at 22:55):

Luke Boswell (Apr 24 2024 at 22:57):

Luke Boswell (Apr 24 2024 at 23:02):

Luke Boswell (Apr 24 2024 at 23:12):

timotree (Apr 28 2024 at 17:26):

Ian McLerran (Apr 28 2024 at 18:09):

Richard Feldman (Apr 28 2024 at 18:15):

Richard Feldman (Apr 28 2024 at 18:16):

Ian McLerran (Apr 28 2024 at 18:36):

Ian McLerran (Apr 28 2024 at 18:37):

Richard Feldman (Apr 28 2024 at 18:38):

Luke Boswell (Apr 29 2024 at 10:18):