This is an extremely beginners' question, but I can't figure out how to get the length of a string in Roc. Can someone help please?
Hi @Abhinav Sarkar,
If you only expect your Str to contain ascii characters, you can use List.len (Str.toUtf8 myStr)
. For an in depth treatment of this suprisingly complicated problem I recommend reading the sections Unicde, Graphemes, Code Points... near the top of the Str docs.
I'd like to handle Unicode too. I read the docs. It recommends using the roc-unicode package, but it does not have any releases.
And I don't know how to use a package from Github without releases. Can't find any doc on that.
Do we have something like countGraphemes
in roc-unicode that works @Luke Boswell?
Can you tell us a bit more about why you need the length of a string @Abhinav Sarkar? That may allow me to give some specific suggestions.
I'm writing a pretty printing algorithm and it requires me to know how many columns a strings takes when printed. If I understand correctly, that's the grapheme count of string.
Yeah, Grapheme.split
from roc-lang/unicode can be used to do this with unicode.
We don't have a release so you currently have to clone that repo (it's tiny) and use a local reference
You can use a path for a package, like we do here
Thanks!
Or if you are looking for unicode CodePoints https://github.com/roc-lang/unicode/blob/main/examples/simple.roc
^^ this example shows how to parse utf8 and also the package from a path
I think for a pretty printing "column" count, grapheme count makes the most sense
if you're in a monospace font, 1 grapheme will more often correspond to 1 "character width" than anything else will (including code points), although it's still not guaranteed
for example, some East Asian characters are defined to be half-width
I'd like to be as accurate as possible. Any suggestion for dealing with such cases?
I actually just tried this out (I asked chatGPT for some concrete examples of them) and they seem to all render the same width in my terminal:
so maybe that's not an issue in practice
so grapheme count should work fine?
if you wanted to be maximally accurate, I think you'd need to get specific fonts involved and measure their glyph widths, which probably doesn't make sense for a pretty printer :big_smile:
personally I would use grapheme count for this, yeah
I can't think of a measurement that's font-agnostic and would work better than grapheme count
I rather not deal with fonts. I'll assume monospace. Let me figure out how to get grapheme count.
Also a feedback as a noob, it seems really wrong to go through all these hoops to get a string's length.
I could be misunderstanding but I think the "issue" here isn't a Roc-specific one. It might appear Roc is asking you to go through extra hoops, but that's probably because other languages in some cases gloss over things a bit too much, and make assumptions on behalf of the user.
Whether this is an indication that there could exist a tiny bit more user-friendly library in user space - probably, yes, but it'll come with caveats, as there aren't any shortcuts which could be taken without compromise in this case.
Sure, but in my life I've never seen a single language that has a String data type and does not have a length function in it. TBH, I was very surprised with Roc's choice. Just my 2c.
yeah this is an intentional choice
the way it goes in most languages is:
Str.len
or similar my hope is to replace that status quo with:
Str.len
or similar, discover it's not therethis flow will be smoother when we have a release of roc-unicode
, to be fair :sweat_smile:
I've never seen a language like it either. I was surprised too. But I'm finding there are lots of little details like this where a lot of care has gone into thinking about how to set people up for success. It's why I love using roc so much.
I'd love to improve the unicode library. If anyone is interested in helping with that then please let me know. A second pair of eyes to help make things more sensible would be amazing.
@Abhinav Sarkar, my understanding is that one of the philosophies behind the design of Roc is to minimise (or eliminate altogether) the number of ways a developer could shoot themselves in the foot.
One of the directions for ensuring this is via not allowing use-case ambiguities and misinterpretation of semantics to be mistaken to be shortcomings of the language.
The language enables the user to truly attempt to identify and understand their use-case, and act accordingly.
If there's is insufficient clarity or understanding in a use-case, then this signifies that the language cannot fill it in on behalf of the user, because at best that would be a guess. That's why the user is empowered to actually get to understand better the underlying problem they're trying to solve.
Again, user-space libraries do not have to share the same methodologies as the standard library, and many users might find functionality providing glossing-over of concepts (which in other languages may come as part of the standard library) available at that kind of level.
Richard Feldman said:
this flow will be smoother when we have a release of
roc-unicode
, to be fair :sweat_smile:
When is the release expected?
Nothing stopping use from making a release today. Though it doesn't cover much of unicode yet, and hasn't been tested in great detail, so maybe it's not a good idea?? I dont know but I guess a release would indicate a lvel of maturity, and what is there is still a WIP though usable.
It took me quite a while to build the text segmentation for graphemes. It would probably be much easier now.
I was planning to do some fuzz testing with it, but keep getting distracted on other projects and work.
Can make it release 0.0.0-alpha or something
The visual width in a terminal is implemented using the wcwidth function, the python package has an explanation of it here: https://pypi.org/project/wcwidth/
I actually started working on a roc-wcwidth package a while ago for a project, but didn't finish it. I'll dig it up tomorrow and see if I can finish it.
Hannes said:
The visual width in a terminal is implemented using the wcwidth function, the python package has an explanation of it here: https://pypi.org/project/wcwidth/
Will it be not same as grapheme count as discussed above?
Characters of category East Asian Wide (W) or East Asian Full-width (F) which are displayed using two terminal cells.
Those count as 2 cells
So it will be different
that's a good point, this is probably a better choice than graphemes!
you could always start with graphemes to get close and then switch to using this later, if you don't want to block on it
Would it be something worth adding from the platform? https://github.com/unicode-rs/unicode-width
Given basic CLI will likely have tools built for a terminal, maybe we should expose that as a helper until we can write a pure roc implementation?
Actually, we might have most of the things we need in our unicode package already
yeah it's probably not very different from graphemes I'd imagine
From what I can tell, there is just the EastAsianWidth data file which maps codepoints to the width property. And then everything else is (neutral) and given a width of 1, except a handful of harcoded cases like em dash.
Actually, it's just this data file
# The format is two fields separated by a semicolon.
# Field 0: Unicode code point value or range of code point values
# Field 1: East_Asian_Width property, consisting of one of the following values:
# "A", "F", "H", "N", "Na", "W"
# - All code points, assigned or unassigned, that are not listed
# explicitly are given the value "N".
The Unicode Character Database [UCD] assigns to each Unicode character as its default width property one of six values: Ambiguous, Fullwidth, Halfwidth, Narrow, Wide, or Neutral (= Not East Asian). For any given operation, these six default property values resolve into only two property values, narrow and wide, depending on context.
Screenshot-2024-04-25-at-09.02.19.png
We already have a few examples that do this in our package, so this would be quite easy to implement.
If someone would like to have a crack at this, we just need to add that data file to unicode/package/data
, then write a InternalEAWGen.roc
file that is almost a copy paste of InternalGBPGen.roc to parse the data file and generates a Roc file that maps CodePoints CP
to an East Asian Width property EAW : [Ambiguous, Fullwidth, Halfwidth, Narrow, Neutral, Wide]
, and then implement a corresponding helper that uses this to walk through a List U8
or a Str
and sum of the width.
This is some of what the InternalGBP.roc
looks like.
interface InternalGBP
exposes [GBP, fromCP, isExtend, isZWJ]
imports [InternalCP.{ CP, toU32, fromU32Unchecked }]
GBP : [CR, Control, Extend, ZWJ, RI, Prepend, SpacingMark, V, T, LF, LVT, LV, L, Other]
isCR : U32 -> Bool
isCR = \u32 -> (u32 == 13)
# etc
Aren't emojis double width?
I believe emoji are all quadruple length.
For example: "🙂" |> Str.toUtf8 == [240, 159, 153, 130]
I think this is a question about visible width - like how it will be printed in a terminal using a monospace font
this whole thread is reinforcing the decision to not include "string length" in the Str
module - we've discussed like 5 different concepts of length, and all but one of them have turned out to be the wrong answer for the use case! :sweat_smile:
Clearly should have caught up on a little more context before dropping off topic responses! :woozy_face:
Like just the comment before timotree! :sweat_smile:
no worries, the rabbit hole is very deep on this topic! :laughing:
https://github.com/roc-lang/unicode/issues/6 created to track the above
Last updated: Jul 06 2025 at 12:14 UTC