Stream: beginners

Topic: Questions about API design


view this post on Zulip Paul Stanley (Nov 24 2024 at 10:59):

I've been playing with the unicode package, to add normalization. I've made decent progress on the guts, but I now have some questions about API design, which I really know nothing about. I guess before I go much further I should contact the maintainers of that package, but it seemed worth asking some general questions here.

  1. Internally, I work exclusively with lists of codepoints. Clearly, the library should expose functions that handle String. But should it also expose functions that deal with codepoint lists? I can see that users might well prefer not to repeatedly pay the overhead of conversion from/to Utf8. But the existing part of the package only exposes a single function split, which is only defined for String, and I suppose there is an argument for keeping things minimal. My hunch, though, is that it would be best to expose both kinds of function.

  2. Related question. Things can go wrong when converting a utf-8 string to a codepoint list, though it's not going to be common and it's always going to be pathological. So far the library only exposes functions which return a Result, reflecting that risk.

But to me at least, "Result poisoning" is a thing (though I think this may run against the grain of Roc in that). There are times when a sane thing to do is to assume that input will be valid and crash if it isn't. After all Num makes that assumption. It could wrap every mathematical operation to return a Result, but it accepts that it will sometimes be better to trust the program to know what it's doing, rather than insist on unwrapping and checking every operation which might in theory fail, even at risk of runtime disaster.

I would therefore favour exposing "unsafe" or "unchecked" version of very basic operations, which crashes on failure, as well as a Result version. Is that wrong?

If that is to be done (1) should the "safe" version be the default (so we have, e.g., toNfc : String -> Result String [Utf8ParseError] and toNfcUnchecked : String -> String). Or should the "unsafe" version be the default (so we have toNfc : String -> String and toNfcChecked : String -> Result String [Utf8ParseError). If the default form is checked, what is the most idiomatic way of naming the "non-Result" version:Unchecked? Unsafe?

I lean towards making the unchecked version the default (because I think it's the one people would normally use, for good reason) and naming the checked version to___Checked, which seems consistent with basic functions for numbers. But if so, the same approach should really be taken throughout the whole package.

How much to expose? In particular:

Is there a good reason to expose functions which check whether a string is in a particular normal form? It seems obvious, but I'm not sure I can really think of a time when a normal user would find that interesting, and the increased efficiency of checking directly rather than testing whether a string is in normal form with if str == toNfc str ... is minimal (since the normalization process does that check efficiently and simply returns the string if it's already normalized). I guess I'm asking: is it generally better for a library to expose a minimal set of functions, or should one say "well, I have the tool, so I can easily make it available"? Is the Roc idiom a "batteries included"/"kitchen sink" approach--definitely a thing in some libraries I've used in the past--of does it prefer minimalism: only basic functions that users will regularly need?

In unicode, strings can be normalized into four different forms (NFC, NFD, NFKC, and NFKD--the details don't matter for present purposes). API question: is this better?

 normalize : String, [NFC, NFD, NFKC, NFKD] -> String

or this?

toNFC : String -> String
toNFD : String -> String
# ...etc

or both forms? (A user will usually want to use a single normalization form everywhere, so will probably want to define a function that normalizes "as expected" by the particular program.)

Finally, one particular issue with normalization is that it is not stable when you concatentate. So if str1 is in normal form C and str2 is in normal form C, it does not follow that str1 + str2 will be. Again, there are two options: the user could concatenate the strings with Str.concat, and then normalize, or I could offer a normalized concatenation function. In theory at least I could make that more efficient, but I really wonder whether (in principle) one should be aiming to expose minor variants on functions which really "belong" elsewhere simply for the sake of (speculative) efficiency gains.

view this post on Zulip Luke Boswell (Nov 24 2024 at 11:17):

This is fantastic. Super cool to hear you thinking about this and sharing here.

I can try and answer some of your questions tomorrow with my 2cents, though I suspect Richard will have thoughts on these. He has given me a lot of guidance with API design, and has thoughts around the roc unicode experience.

Looking forward to seeing this project develop :grinning:

view this post on Zulip Paul Stanley (Nov 24 2024 at 12:03):

Well it's developed at least to the point that I appear to be passing the NFC and NFD tests in the Unicode test file (though working with a test file that has nearly 21,000 test cases is its own little bit of fun). I haven't got round to testing the NFKC and NFKD forms yet, but I'm cautiously optimistic, because they add no new algorithms. What I haven't really attempted yet is any sort of testing of speed or optimization, and that may well end up being the road block, so there's a lot of water still to flow under the bridge. I'm also using rather naive data structures (as, to be fair, the Grapheme stuff does). So ...

What I obviously don't want to do is go miles down a particular API/basic design route and then find that this doesn't fit into what is envisaged for the rest of the library. I'm quite enthusiastic in theory about tackling collation, but that really is a mountain to climb.

view this post on Zulip Richard Feldman (Nov 24 2024 at 14:00):

these are excellent questions! :smiley:

Paul Stanley said:

  1. Internally, I work exclusively with lists of codepoints. Clearly, the library should expose functions that handle String. But should it also expose functions that deal with codepoint lists? I can see that users might well prefer not to repeatedly pay the overhead of conversion from/to Utf8. But the existing part of the package only exposes a single function split, which is only defined for String, and I suppose there is an argument for keeping things minimal. My hunch, though, is that it would be best to expose both kinds of function.

I think this is a tough design question because:

In other words, arguably exposing it is good for advanced users, but also arguably the advanced users in question are basically just people working on this library, and also arguably exposing it is a footgun.

I'd say for now let's just default to the Str one. We can always expose more later if there's demand in practice for it outside this library! :big_smile:

view this post on Zulip Richard Feldman (Nov 24 2024 at 14:37):

Paul Stanley said:

Things can go wrong when converting a utf-8 string to a codepoint list, though it's not going to be common and it's always going to be pathological. So far the library only exposes functions which return a Result, reflecting that risk.

But to me at least, "Result poisoning" is a thing (though I think this may run against the grain of Roc in that). There are times when a sane thing to do is to assume that input will be valid and crash if it isn't. After all Num makes that assumption. It could wrap every mathematical operation to return a Result, but it accepts that it will sometimes be better to trust the program to know what it's doing, rather than insist on unwrapping and checking every operation which might in theory fail, even at risk of runtime disaster.

I would therefore favour exposing "unsafe" or "unchecked" version of very basic operations, which crashes on failure, as well as a Result version. Is that wrong?

this is a pretty deep topic, but I think the tutorial section on crashing is useful here:

crash is not for error handling.

The reason Roc has a crash keyword is for scenarios where it's expected that no error will ever happen (like in unreachable branches), or where graceful error handling is infeasible (like running out of memory).

Errors that are recoverable should be represented using normal Roc types (like Result) and then handled without crashing—for example, by having the application report that something went wrong, and then continue running from there.

view this post on Zulip Richard Feldman (Nov 24 2024 at 14:43):

in this particular case, I would say:

So in this package, I would have zero functions with the names "__Checked" and "__Unchecked" - either it should always return a Result, or it's dealing with Str to UTF-8 and should crash in branches that ought to be unreachable (unless there are bugs in the Roc compiler or in the platform, which packages should assume aren't happening).

view this post on Zulip Richard Feldman (Nov 24 2024 at 14:46):

Paul Stanley said:

one particular issue with normalization is that it is not stable when you concatentate. So if str1 is in normal form C and str2 is in normal form C, it does not follow that str1 + str2 will be. Again, there are two options: the user could concatenate the strings with Str.concat, and then normalize, or I could offer a normalized concatenation function. In theory at least I could make that more efficient, but I really wonder whether (in principle) one should be aiming to expose minor variants on functions which really "belong" elsewhere simply for the sake of (speculative) efficiency gains.

I could see an argument for making a NormalStr type (and module) which exposes fromSt and toStr and then also has its own (more efficient) concatenation operations.

this would also have the benefit of giving users a nice way to write functions that only accept a normalized str, etc. - which can prevent other bugs!

view this post on Zulip Richard Feldman (Nov 24 2024 at 15:17):

Paul Stanley said:

Is there a good reason to expose functions which check whether a string is in a particular normal form? It seems obvious, but I'm not sure I can really think of a time when a normal user would find that interesting, and the increased efficiency of checking directly rather than testing whether a string is in normal form with if str == toNfc str ... is minimal (since the normalization process does that check efficiently and simply returns the string if it's already normalized).

I guess I'm asking: is it generally better for a library to expose a minimal set of functions, or should one say "well, I have the tool, so I can easily make it available"? Is the Roc idiom a "batteries included"/"kitchen sink" approach--definitely a thing in some libraries I've used in the past--of does it prefer minimalism: only basic functions that users will regularly need?

I don't think of good API design as being about "expose a lot" versus "expose a little" - I think it's more important to balance these things:

view this post on Zulip Richard Feldman (Nov 24 2024 at 15:20):

so to me, an argument for not to exposing those functions would be something like "this isn't something I think people will need, and it might be a footgun in that people might incorrectly reach for it when they'd get better performance if they reached for NormalStr instead"

view this post on Zulip Richard Feldman (Nov 24 2024 at 15:24):

one technique I've seen in Elm for balancing considerations like this is to have a module named ___LowLevel (e.g. CodePointLowLevel) - which can make it clear that "this isn't something you should reach for normally; rather, it's something you reach for if the regular API doesn't expose any possible way to do the thing you want to do, and you're willing to resort to a (usually less ergonomic but more flexible) lower-level API because you just really need it to be done in that way and there's no other alternative"

view this post on Zulip Richard Feldman (Nov 24 2024 at 15:24):

that might be a reasonable place to put a function like detecting normalization

view this post on Zulip Richard Feldman (Nov 24 2024 at 15:27):

anyway, thank you for the excellent questions! I really appreciate that you're approaching this in such a careful and thoughtful way. :smiley:

view this post on Zulip Paul Stanley (Nov 24 2024 at 15:27):

That's all incredibly helpful. The implementation of grapheme currently has this note:

# TODO DISCUSS
# I'm not sure if we should return an error here or just crash.
# A Roc Str should be be valid utf8 and so in theory it should not be possible
# for split to have invalid utf8 in it. To be discussed.

It sounds like the sensible option (in the library as a whole) therefore would just be to crash if for any reason it gets invalid utf-8 in a string. Since normalization (assuming we've got so far) is not something that is going to error, there's no particular reason to return a Result. But it would definitely make sense to be consistent in this across the entire library.

view this post on Zulip Richard Feldman (Nov 24 2024 at 15:31):

yeah, makes sense!

view this post on Zulip Paul Stanley (Nov 24 2024 at 15:36):

I wouldn't think it necessary to expose any public functions handling lists of U8s. That can and should be left to String. But internally the Unicode package works with a custom CodePoint type, which wraps an internal custom type which wraps a U32. There's some overhead in converting from utf-8 to codepoints and back, and I can see that anyone doing a substantial amount of unicode manipulation might prefer to work with codepoints. I'll think carefully about what I expose. For the moment, I have to expose some of the list functions I think because it's the only way for me to produce the necessary automated test without a lot of roundtripping from lists of codepoints to U8 and back again, which seems undesirable. But I could hide that behind an internal module.

Anyway. Thanks for the very thoughtful and useful guidance.

view this post on Zulip Paul Stanley (Nov 24 2024 at 16:10):

And ... yes ... it seems to be doing what it should do (though I haven't worked out how to run 21,000 test cases in any sort of completely reliable way ... so fingers-still crossed that passing 7,000 of them is a hint of where we're headed).

Just occasionally it's nice to see an expectation fail (this may only make sense if you're into unicode normalization ...):

expect
    str = "Café society"
    res = normalize NFD str
    dbg Str.toUtf8 str

    dbg Str.toUtf8 res

    str == res

Running tests…

[Normalization.roc:426] Str.toUtf8 str = [67, 97, 102, 195, 169, 32, 115, 111, 99, 105, 101, 116, 121]

[Normalization.roc:428] Str.toUtf8 res = [67, 97, 102, 101, 204, 129, 32, 115, 111, 99, 105, 101, 116, 121]
── EXPECT FAILED in Normalization.roc ─────

This expectation failed:
423│>  expect
424│>      str = "Café society"
425│>      res = normalize NFD str

When it failed, these variables had these values:
str : Str
str = "Café society"
res : Str
res = "Café society"

view this post on Zulip Brendan Hansknecht (Nov 24 2024 at 16:23):

One other note, NFC, NFD, etc are not meaningful names

view this post on Zulip Brendan Hansknecht (Nov 24 2024 at 16:24):

They shouldn't be part of the API

view this post on Zulip Brendan Hansknecht (Nov 24 2024 at 16:24):

We should make those terms longer in a way that is meaningful to people who know little about Unicode

view this post on Zulip Brendan Hansknecht (Nov 24 2024 at 16:25):

General question, how often would a user want to pick a specific normalization vs just use a default normalization? Like will 99% of the time people just want to use NFC for example?

view this post on Zulip Richard Feldman (Nov 24 2024 at 16:31):

Paul Stanley said:

For the moment, I have to expose some of the list functions I think because it's the only way for me to produce the necessary automated test without a lot of roundtripping from lists of codepoints to U8 and back again, which seems undesirable. But I could hide that behind an internal module.

yeah I think an internal module sounds like the way to go!

In general, I have a very hardline stance that the public-facing API should never, ever be changed for the sake of internal automated tests.

Internal tests are a tool for helping to achieve the goal of package quality, but the API is a huge part of the package's quality itself. So sacrificing the package's quality in order to make it easier to use a tool that can potentially improve the package's quality is not the right tradeoff to me - I'd much rather find another way to test the package! :big_smile:

view this post on Zulip Paul Stanley (Nov 24 2024 at 18:17):

Brendan Hansknecht said:

We should make those terms longer in a way that is meaningful to people who know little about Unicode

They are about as meaningful as they can be! These are the unicode terms, deliberately chosen because they are not exactly meaningful -- and the "meaningful" versions (canonical composition, canonical decomposition, compatibility decomposition, and compatibility decomposition with canonical composition) are not any better. It would just be very odd to use anything other than the Unicode Consortium approved acronyms, I think.

So I think the right thing to do here is to use the jargon. I've written a fairly long piece of documentation for the module which explains them in more detail and includes recommendations about which to use. (Bottom line: probably NFC for most purposes ...)

view this post on Zulip Paul Stanley (Nov 24 2024 at 22:26):

Thanks for all the advice. I split it into two modules (one "internal" and intended for internal package use such as testing only), and decided provisionally to expose both a "work-or-crash" version and a "check-and-result" version, consistently with the way Num does it, so at least anyone who is massively crash-phobic can avoid even the tiniest risk. Otherwise, it exposes just two functions: normalize and normalizeChecked, each of which takes a tag to determine which normalization form it produces.

I wrote a long-ish explanation of normalization at the top of the main module file, but roc docs doesn't seem to do anything with it: not sure what I should be doing about that.

Anyway, still very much WIP, but it's sitting on a fork at https://github.com/PaulStanley/unicode. One thing I'm currently having trouble with is working out how to test 21,000 odd test cases without exploding ... but it manages the first 7,000 without error, so fingers crossed. I'm pretty sure the real problem will be with speed and efficiency.

view this post on Zulip Richard Feldman (Nov 24 2024 at 22:39):

yooooo this is awesome, amazing work!!! :star_struck:

view this post on Zulip Brendan Hansknecht (Nov 24 2024 at 23:47):

Paul Stanley said:

They are about as meaningful as they can be! These are the unicode terms, deliberately chosen because they are not exactly meaningful -- and the "meaningful" versions (canonical composition, canonical decomposition, compatibility decomposition, and compatibility decomposition with canonical composition) are not any better.

Sad, but makes sense

view this post on Zulip Brendan Hansknecht (Nov 24 2024 at 23:49):

Paul Stanley said:

consistently with the way Num does it, so at least anyone who is massively crash-phobic can avoid even the tiniest risk. Otherwise, it exposes just two functions: normalize and normalizeChecked, each of which takes a tag to determine which normalization form it produces.

Please don't. Num primitives for math are a class of special exceptions. This is not the API to model off of. In packages, we have debated completely banning crash. Crash is only meant for unreachable states within packages and not for error handling. Packages should all be crash-phobic.

view this post on Zulip Brendan Hansknecht (Nov 24 2024 at 23:50):

If a user wants to crash, they can make a wrapper in their application for doing so. The default should be results and error handling

view this post on Zulip Luke Boswell (Nov 24 2024 at 23:55):

There is currently a crash in the unicode package... I left it there because I was trying to find all the edgcases for grapheme text segmentation. I think we should look at replacing that with a Result intead. Hannes helped with some fuzzing, but I haven't had the time to go back and clean all that up and fix it.

view this post on Zulip Richard Feldman (Nov 25 2024 at 01:58):

Paul Stanley said:

I wrote a long-ish explanation of normalization at the top of the main module file, but roc docs doesn't seem to do anything with it: not sure what I should be doing about that.

this is a bug in docs generation currently

view this post on Zulip Richard Feldman (Nov 25 2024 at 03:51):

Paul Stanley said:

decided provisionally to expose both a "work-or-crash" version and a "check-and-result" version, consistently with the way Num does it, so at least anyone who is massively crash-phobic can avoid even the tiniest risk. Otherwise, it exposes just two functions: normalize and normalizeChecked, each of which takes a tag to determine which normalization form it produces.

if I understand this right, the only way this can fail is if a Str contains invalid UTF-8, which I think packages should assume never happens.

In other words, this is a case (branch that should be unreachable) where I think crash is the right design, and I think the best design would be to expose normalize but not normalizeChecked :big_smile:

view this post on Zulip Brendan Hansknecht (Nov 25 2024 at 04:46):

Ah, if this is for an unreachable case, then yeah, crash and no checked variant

view this post on Zulip Paul Stanley (Nov 25 2024 at 07:32):

Yes. I think that's the case, though I need to think a bit more about whether the unicode CodePoint module could ever fail to parse valid utf-8 to valid unicode codepoints (and that's existing code). Anyway, the main thing is I certainly understand the design philosophy on this sort of thing ... crash _only_ for unreachable cases. At which point I suppose the checked variant is un-necessary. I will proceed accordingly.

view this post on Zulip Paul Stanley (Nov 25 2024 at 07:33):

I'm quite certain that normalization as such can't turn valid unicode into invalid unicode. The issue, if there is one, is entirely at the boundary between String and CodePoint.

view this post on Zulip Richard Feldman (Nov 25 2024 at 12:41):

yeah valid UTF-8 should always be parseable into code points, so if that ever fails it's because of a bug in our implementation!


Last updated: Jul 05 2025 at 12:14 UTC