Stream: beginners

Topic: How to get String length?


view this post on Zulip Abhinav Sarkar (Apr 23 2024 at 10:05):

This is an extremely beginners' question, but I can't figure out how to get the length of a string in Roc. Can someone help please?

view this post on Zulip Anton (Apr 23 2024 at 10:14):

Hi @Abhinav Sarkar,
If you only expect your Str to contain ascii characters, you can use List.len (Str.toUtf8 myStr). For an in depth treatment of this suprisingly complicated problem I recommend reading the sections Unicde, Graphemes, Code Points... near the top of the Str docs.

view this post on Zulip Abhinav Sarkar (Apr 23 2024 at 10:15):

I'd like to handle Unicode too. I read the docs. It recommends using the roc-unicode package, but it does not have any releases.

view this post on Zulip Abhinav Sarkar (Apr 23 2024 at 10:16):

And I don't know how to use a package from Github without releases. Can't find any doc on that.

view this post on Zulip Anton (Apr 23 2024 at 10:42):

Do we have something like countGraphemes in roc-unicode that works @Luke Boswell?
Can you tell us a bit more about why you need the length of a string @Abhinav Sarkar? That may allow me to give some specific suggestions.

view this post on Zulip Abhinav Sarkar (Apr 23 2024 at 10:44):

I'm writing a pretty printing algorithm and it requires me to know how many columns a strings takes when printed. If I understand correctly, that's the grapheme count of string.

view this post on Zulip Luke Boswell (Apr 23 2024 at 10:48):

Yeah, Grapheme.split from roc-lang/unicode can be used to do this with unicode.

view this post on Zulip Luke Boswell (Apr 23 2024 at 10:49):

We don't have a release so you currently have to clone that repo (it's tiny) and use a local reference

view this post on Zulip Anton (Apr 23 2024 at 10:50):

You can use a path for a package, like we do here

view this post on Zulip Abhinav Sarkar (Apr 23 2024 at 10:51):

Thanks!

view this post on Zulip Luke Boswell (Apr 23 2024 at 10:51):

Or if you are looking for unicode CodePoints https://github.com/roc-lang/unicode/blob/main/examples/simple.roc

view this post on Zulip Luke Boswell (Apr 23 2024 at 10:52):

^^ this example shows how to parse utf8 and also the package from a path

view this post on Zulip Richard Feldman (Apr 23 2024 at 11:07):

I think for a pretty printing "column" count, grapheme count makes the most sense

view this post on Zulip Richard Feldman (Apr 23 2024 at 11:09):

if you're in a monospace font, 1 grapheme will more often correspond to 1 "character width" than anything else will (including code points), although it's still not guaranteed

view this post on Zulip Richard Feldman (Apr 23 2024 at 11:10):

for example, some East Asian characters are defined to be half-width

view this post on Zulip Abhinav Sarkar (Apr 23 2024 at 11:10):

I'd like to be as accurate as possible. Any suggestion for dealing with such cases?

view this post on Zulip Richard Feldman (Apr 23 2024 at 11:12):

I actually just tried this out (I asked chatGPT for some concrete examples of them) and they seem to all render the same width in my terminal:

view this post on Zulip Richard Feldman (Apr 23 2024 at 11:12):

so maybe that's not an issue in practice

view this post on Zulip Abhinav Sarkar (Apr 23 2024 at 11:12):

so grapheme count should work fine?

view this post on Zulip Richard Feldman (Apr 23 2024 at 11:13):

if you wanted to be maximally accurate, I think you'd need to get specific fonts involved and measure their glyph widths, which probably doesn't make sense for a pretty printer :big_smile:

view this post on Zulip Richard Feldman (Apr 23 2024 at 11:13):

personally I would use grapheme count for this, yeah

view this post on Zulip Richard Feldman (Apr 23 2024 at 11:13):

I can't think of a measurement that's font-agnostic and would work better than grapheme count

view this post on Zulip Abhinav Sarkar (Apr 23 2024 at 11:16):

I rather not deal with fonts. I'll assume monospace. Let me figure out how to get grapheme count.

view this post on Zulip Abhinav Sarkar (Apr 23 2024 at 11:16):

Also a feedback as a noob, it seems really wrong to go through all these hoops to get a string's length.

view this post on Zulip Hristo (Apr 23 2024 at 11:40):

I could be misunderstanding but I think the "issue" here isn't a Roc-specific one. It might appear Roc is asking you to go through extra hoops, but that's probably because other languages in some cases gloss over things a bit too much, and make assumptions on behalf of the user.

Whether this is an indication that there could exist a tiny bit more user-friendly library in user space - probably, yes, but it'll come with caveats, as there aren't any shortcuts which could be taken without compromise in this case.

view this post on Zulip Abhinav Sarkar (Apr 23 2024 at 11:43):

Sure, but in my life I've never seen a single language that has a String data type and does not have a length function in it. TBH, I was very surprised with Roc's choice. Just my 2c.

view this post on Zulip Richard Feldman (Apr 23 2024 at 11:48):

yeah this is an intentional choice

view this post on Zulip Richard Feldman (Apr 23 2024 at 11:52):

the way it goes in most languages is:

  1. Use a builtin Str.len or similar
  2. Emoji bugs get reported
  3. Learn about the different ways string length can be represented (graphemes and such), figure out which one makes the best sense for this use case, implement a fix

my hope is to replace that status quo with:

  1. Look for builtin Str.len or similar, discover it's not there
  2. Ask in #beginners how to get string length in Roc
  3. Learn about the different ways string length can be represented (graphemes and such), figure out which one makes the best sense for this use case, implement it using that way the first time around instead of shipping with Unicode bugs :smiley:

view this post on Zulip Richard Feldman (Apr 23 2024 at 11:52):

this flow will be smoother when we have a release of roc-unicode, to be fair :sweat_smile:

view this post on Zulip Luke Boswell (Apr 23 2024 at 11:53):

I've never seen a language like it either. I was surprised too. But I'm finding there are lots of little details like this where a lot of care has gone into thinking about how to set people up for success. It's why I love using roc so much.

I'd love to improve the unicode library. If anyone is interested in helping with that then please let me know. A second pair of eyes to help make things more sensible would be amazing.

view this post on Zulip Hristo (Apr 23 2024 at 11:54):

@Abhinav Sarkar, my understanding is that one of the philosophies behind the design of Roc is to minimise (or eliminate altogether) the number of ways a developer could shoot themselves in the foot.
One of the directions for ensuring this is via not allowing use-case ambiguities and misinterpretation of semantics to be mistaken to be shortcomings of the language.
The language enables the user to truly attempt to identify and understand their use-case, and act accordingly.

view this post on Zulip Hristo (Apr 23 2024 at 11:57):

If there's is insufficient clarity or understanding in a use-case, then this signifies that the language cannot fill it in on behalf of the user, because at best that would be a guess. That's why the user is empowered to actually get to understand better the underlying problem they're trying to solve.

Again, user-space libraries do not have to share the same methodologies as the standard library, and many users might find functionality providing glossing-over of concepts (which in other languages may come as part of the standard library) available at that kind of level.

view this post on Zulip Abhinav Sarkar (Apr 23 2024 at 13:23):

Richard Feldman said:

this flow will be smoother when we have a release of roc-unicode, to be fair :sweat_smile:

When is the release expected?

view this post on Zulip Luke Boswell (Apr 23 2024 at 21:32):

Nothing stopping use from making a release today. Though it doesn't cover much of unicode yet, and hasn't been tested in great detail, so maybe it's not a good idea?? I dont know but I guess a release would indicate a lvel of maturity, and what is there is still a WIP though usable.

It took me quite a while to build the text segmentation for graphemes. It would probably be much easier now.

I was planning to do some fuzz testing with it, but keep getting distracted on other projects and work.

view this post on Zulip Brendan Hansknecht (Apr 24 2024 at 00:27):

Can make it release 0.0.0-alpha or something

view this post on Zulip Hannes (Apr 24 2024 at 14:27):

The visual width in a terminal is implemented using the wcwidth function, the python package has an explanation of it here: https://pypi.org/project/wcwidth/

I actually started working on a roc-wcwidth package a while ago for a project, but didn't finish it. I'll dig it up tomorrow and see if I can finish it.

view this post on Zulip Abhinav Sarkar (Apr 24 2024 at 15:15):

Hannes said:

The visual width in a terminal is implemented using the wcwidth function, the python package has an explanation of it here: https://pypi.org/project/wcwidth/

Will it be not same as grapheme count as discussed above?

view this post on Zulip Brendan Hansknecht (Apr 24 2024 at 22:32):

Characters of category East Asian Wide (W) or East Asian Full-width (F) which are displayed using two terminal cells.

Those count as 2 cells

view this post on Zulip Brendan Hansknecht (Apr 24 2024 at 22:32):

So it will be different

view this post on Zulip Richard Feldman (Apr 24 2024 at 22:43):

that's a good point, this is probably a better choice than graphemes!

view this post on Zulip Richard Feldman (Apr 24 2024 at 22:44):

you could always start with graphemes to get close and then switch to using this later, if you don't want to block on it

view this post on Zulip Luke Boswell (Apr 24 2024 at 22:46):

Would it be something worth adding from the platform? https://github.com/unicode-rs/unicode-width

view this post on Zulip Luke Boswell (Apr 24 2024 at 22:47):

Given basic CLI will likely have tools built for a terminal, maybe we should expose that as a helper until we can write a pure roc implementation?

view this post on Zulip Luke Boswell (Apr 24 2024 at 22:50):

Actually, we might have most of the things we need in our unicode package already

view this post on Zulip Richard Feldman (Apr 24 2024 at 22:52):

yeah it's probably not very different from graphemes I'd imagine

view this post on Zulip Luke Boswell (Apr 24 2024 at 22:54):

From what I can tell, there is just the EastAsianWidth data file which maps codepoints to the width property. And then everything else is (neutral) and given a width of 1, except a handful of harcoded cases like em dash.

view this post on Zulip Luke Boswell (Apr 24 2024 at 22:55):

Actually, it's just this data file

view this post on Zulip Luke Boswell (Apr 24 2024 at 22:55):

# The format is two fields separated by a semicolon.
# Field 0: Unicode code point value or range of code point values
# Field 1: East_Asian_Width property, consisting of one of the following values:
#         "A", "F", "H", "N", "Na", "W"
#  - All code points, assigned or unassigned, that are not listed
#      explicitly are given the value "N".

view this post on Zulip Luke Boswell (Apr 24 2024 at 22:57):

The Unicode Character Database [UCD] assigns to each Unicode character as its default width property one of six values: Ambiguous, Fullwidth, Halfwidth, Narrow, Wide, or Neutral (= Not East Asian). For any given operation, these six default property values resolve into only two property values, narrow and wide, depending on context.

view this post on Zulip Luke Boswell (Apr 24 2024 at 23:02):

Screenshot-2024-04-25-at-09.02.19.png

view this post on Zulip Luke Boswell (Apr 24 2024 at 23:12):

We already have a few examples that do this in our package, so this would be quite easy to implement.

If someone would like to have a crack at this, we just need to add that data file to unicode/package/data, then write a InternalEAWGen.roc file that is almost a copy paste of InternalGBPGen.roc to parse the data file and generates a Roc file that maps CodePoints CP to an East Asian Width property EAW : [Ambiguous, Fullwidth, Halfwidth, Narrow, Neutral, Wide], and then implement a corresponding helper that uses this to walk through a List U8 or a Str and sum of the width.

This is some of what the InternalGBP.roc looks like.

interface InternalGBP
    exposes [GBP, fromCP, isExtend, isZWJ]
    imports [InternalCP.{ CP, toU32, fromU32Unchecked }]

GBP : [CR, Control, Extend, ZWJ, RI, Prepend, SpacingMark, V, T, LF, LVT, LV, L, Other]

isCR : U32 -> Bool
isCR = \u32 -> (u32 == 13)

# etc

view this post on Zulip timotree (Apr 28 2024 at 17:26):

Aren't emojis double width?

view this post on Zulip Ian McLerran (Apr 28 2024 at 18:09):

I believe emoji are all quadruple length.

For example: "🙂" |> Str.toUtf8 == [240, 159, 153, 130]

view this post on Zulip Richard Feldman (Apr 28 2024 at 18:15):

I think this is a question about visible width - like how it will be printed in a terminal using a monospace font

view this post on Zulip Richard Feldman (Apr 28 2024 at 18:16):

this whole thread is reinforcing the decision to not include "string length" in the Str module - we've discussed like 5 different concepts of length, and all but one of them have turned out to be the wrong answer for the use case! :sweat_smile:

view this post on Zulip Ian McLerran (Apr 28 2024 at 18:36):

Clearly should have caught up on a little more context before dropping off topic responses! :woozy_face:

view this post on Zulip Ian McLerran (Apr 28 2024 at 18:37):

Like just the comment before timotree! :sweat_smile:

view this post on Zulip Richard Feldman (Apr 28 2024 at 18:38):

no worries, the rabbit hole is very deep on this topic! :laughing:

view this post on Zulip Luke Boswell (Apr 29 2024 at 10:18):

https://github.com/roc-lang/unicode/issues/6 created to track the above


Last updated: Jul 06 2025 at 12:14 UTC