Syntax Highlighting using the Roc Parser · ideas

Stream: ideas

Topic: Syntax Highlighting using the Roc Parser

Luke Boswell (Nov 25 2022 at 19:49):

Thanks to Josh Warner for this idea

Goal. Propose an approach that evolves Roc's parser to support syntax highlighting for HTML documentation.

Background. I have been doing some research into developing a grammar of Roc in tree-sitter for the use-case of documentation syntax highlighting. A TS grammar also would provide additional benefits, such as folding and support in most text-editors. However, the more I research the more I started to question if this was needed; can we just use the Roc parser instead of maintaining two?

Roc's indentation rules are subtle, so it may be difficult to build and maintain a tree-sitter parser that works correctly. Keywords are particularly challenging as they are context-sensitive with many subtle cases to handle.

Proposed Solution. Emit relevant syntax information as a side stream in the parser. This supports a syntax highlighting use case, and is easy to make optional which eliminates any effect on the production compiler.

@Joshua Warner has recently embarked on a campaign to convert the parser to use combinators. His goal is to have one parser (to rule them all) which can be used in multiple contexts with different behaviour as required. The Parser trait currently has a single parse() method. When everything is combinator-based; it will be possible to have a second parse_for_highlight() method which returns alternate data. In this case, it returns a list of syntax relevant locations in the buffer, i.e. Vec<Loc<SyntaxElement>>.

To do this once the parser is fully converted to combinators; we only need to implement logic in the combinators. We can capture the Loc information for relevant syntax such as; keywords, builtins, constants, literals, strings, comments, operators, and other characters (,: etc. Then when we serialise the text buffer, we can insert <span> and </span> wherever needed.

Note that there is a signifcant amount of up front work before we see any externally-visible progress, however, I think this is a worthwhile objective to work towards.

Richard Feldman (Nov 25 2022 at 20:02):

that sounds great! :smiley:

Luke Boswell (Nov 26 2022 at 02:38):

Also regarding the use case of highlighting for Github from my research it looks like _Syntax highlighting in GitHub is performed using TextMate-compatible grammars_ whichare based on regex. When a language gets to 200 unique repositories in the wild the team there will support it. Tree-sitter isn't supported in Linguist, so it's mostly a useful alternative for editing support in text editors before the roc editor gets up and running. Just thought I would share here for anyone interested.

Joshua Warner (Nov 26 2022 at 04:44):

Thanks for the writeup @Luke Boswell !

Here's a quick proof-of-concept sketch for how I was thinking about implementing this idea: https://github.com/roc-lang/roc/pull/4608

Anton (Nov 26 2022 at 08:56):

There are some key requirements in tree-sitter that we need for the editor's free/legacy editing mode:

Incremental parsing: start by parsing a file completely, subsequent edits are parsed intelligently so that the whole file does not need to be re-parsed.
Error recovery: failures in parsing are contained so that the following lines can still be highlighted correctly.
Fast enough to parse on every keystroke in the editor.
Preserve all whitespace and newlines.
Preserve comments.

So my questions are:

Can the proposed solution fulfill these requirements?
Will fulfilling these requirements degrade the experience for the parsing of files for compilation? For example by making it harder to return good error messages or making maintenance noticeably more difficult.

Kevin Gillette (Nov 26 2022 at 14:55):

Preserving spacing and comments would also be important to the parser a formatter would perhaps need

Joshua Warner (Nov 26 2022 at 17:25):

Incremental parsing is a very similar transform to what @Luke Boswell and I are suggesting for generating tokens for highlighting - we have a third method that all the combinators implement, which caches a map from the observable parts of the State to the parsed return value. For incremental parsing, we just need to invalidate the cached return values that overlap with the edited span and run parsing again. This turns parsing from O(n) in the input length to O(d) in the tree depth.

We'll also need to make sure the Regions from cached AST results still work. This will require a little finangling, but definitely possible.

Joshua Warner (Nov 26 2022 at 17:29):

Error recovery is similar - where after an error we need the parser to be able to skip over some number of unrecognized characters and/or close some number of AST nodes until parsing can continue without error. The main thing that requires on top of what we've talked about up until now, is that the AST needs to be able to represent errors at arbitrary levels - something like formalizing the existing MalformedIdent / etc.

Joshua Warner (Nov 26 2022 at 17:31):

Will fulfilling these requirements degrade the experience for the parsing of files for compilation?

No; very few intermediate combinators need to be introduced to support this, so the overhead should be minimal on the compilation path. And even then, my hope is that compiler inlining can completley eliminate that overhead - but it certainly warrants measurement.

Joshua Warner (Nov 26 2022 at 17:34):

Preserve all whitespace

I'm not sure it's actually necessary to do this. If you have a concrete syntax tree that has nodes for each of the paren characters and the identifier in ( thingy ), with a Loc for each of them - it's easy to infer from the Loc's that there's whitespace between the parens and the ident. You don't need an extra thing in the tree to tell you that.

Joshua Warner (Nov 26 2022 at 17:34):

We already preserve (most) newlines/comments in the AST.

Anton (Nov 26 2022 at 18:49):

Alright, thanks for addressing my concerns @Joshua Warner :)

Anton (Nov 27 2022 at 13:00):

If we're planning on doing major work on the Parser it might be a good time to see if this new Parser AST can be used by the editor.
@Richard Feldman @Folkert de Vries I know we originally planned to use the canonical AST for the editor but if I remember correctly, that one does not contain comments, which we do need for the editor.

I think the editor's AST would need to consist of NodeIds, instead of having the actual values in the tree. This way a plugin can pass an update for a specific NodeId. With that requirement, we could use NodeIds in the parse AST or we'd need to convert from the NodeIds to the parse AST on every edit.

I'm not making any formal proposals here, just thinking through some options.

Joshua Warner (Nov 27 2022 at 16:04):

FWIW, at some point I want to start code-genning the AST structs/enums. Probably based on a description in Roc code, because why not? It should be simple to also codegen something that removes all comment/newline information for canonicalization. Similarly, there could be an "editor-friendly" variant that also gets codegenned at the same time, with conversion routines.

Anton (Nov 27 2022 at 16:25):

Ooh, that sounds interesting :)

Folkert de Vries (Nov 27 2022 at 16:33):

yeah, we maybe should revisit parser data structures. I saw some talk about using tokens recently as well, which is also interesting

Folkert de Vries (Nov 27 2022 at 16:34):

when we started, we rejected tokens because we'd only ever see compilers that have a separate tokenization step. It always looked like (because the examples are small) the tokens were materialized. So you would turn the input string into a list of tokens, then parse those tokens

Folkert de Vries (Nov 27 2022 at 16:34):

but it turns out that is not how it should work: the array of tokens should never materialize in memory

Folkert de Vries (Nov 27 2022 at 16:35):

so, maybe tokens are actually preferable, since there is no performance downside (if done well) and might simplify our parsing logic

Folkert de Vries (Nov 27 2022 at 16:35):

also, both for the editor and the compiler, parsing needs to be fast. The data structures we use are a big part of that

Folkert de Vries (Nov 27 2022 at 16:36):

I am writing a blog post (for my company's company blog, if you have comments on it, let me know) https://hackmd.io/@Q66MPiW4T7yNTKOCaEb-Lw/ryfenBCO5

Folkert de Vries (Nov 27 2022 at 16:38):

where I measure a 12% reduction in memory used in this relatively simple example. I think we could do more of these transforms and achieve more gains. I'd guess that a reduction in memory usage will translate in reduced processing costs during canonicalization

Joshua Warner (Nov 27 2022 at 17:03):

FWIW, zig does a full tokenization pass, preserves the tokens in memory, and then runs the parser over that - and the zig parser is _fast_.

Joshua Warner (Nov 27 2022 at 17:04):

Materializing the tokens in memory in theory opens up opportunities for massively-SIMD-ifying the parser, ala simdjson, except over tokens instead of characters.

Luke Boswell (Feb 15 2023 at 07:10):

Can I please have some feedback or comments on my PR#5018. Is this a worthwhile idea to progress with? any assistance would be most appreciated.

This is a proposal for an interim solution to achieve Roc syntax highlighting in markdown code blocks for our static-site platform, specifically for the use case of the Roc website and tutorial. Longer term options are to have a tree-sitter grammar or to use the Roc compiler to achieve this in a much more reliable way. @Joshua Warner has made a lot of progress recently with the parser and investigated a number of options in this direction which show promising results.

I apologise for my terrible Rust here; I managed to cobble a quick proof of concept together from some examples online. The basic idea is to find and replace all the text for specific Roc keywords with the equivalent html code, i.e. import becomes <span class="kw">import </span>. This is effectively what I have been doing by hand for the examples in our tutorial content.

Joshua Warner (Feb 15 2023 at 16:04):

This certainly seems like a great starting point - dramatically better than not having syntax highlighting (or doing it by hand)!

For a more complete approach, I do still want to eventually make the roc parser usable for highlighting. This enables cool things like positional keywords working correctly and eliminates the possibility of regex-based bugs. I have a few irons in the fire here, but nothing that's quite ready yet - and it doesn't seem worthwhile to keep blocking tutorial work on that (or, making it extra painful).

@Luke Boswell - a few things to watch out for:

You probably want to add a regex for comments and a regex for strings. Trying to get so you can have a # in a string and a string in a comment both work correctly will be complicated - so to a first approximation I wouldn't worry about that.
You (probably?) don't want to highlight keywords if they occur in comments or strings

Luke Boswell (Feb 19 2023 at 07:46):

Made some excellent progress on this thanks to @Joshua Warner for adding a lexing-based 'highlight' mode to the parser. Here is a PR#5051 which adds syntax highlighting using roc_parse crate.

Luke Boswell (Feb 19 2023 at 07:49):

Also I forgot to add the keywords in the example, but it doesn't highlight any keywords in comments etc which is super nice.

Last updated: Jul 23 2026 at 13:15 UTC