Thanks to Josh Warner for this idea
Goal. Propose an approach that evolves Roc's parser to support syntax highlighting for HTML documentation.
Background. I have been doing some research into developing a grammar of Roc in tree-sitter for the use-case of documentation syntax highlighting. A TS grammar also would provide additional benefits, such as folding and support in most text-editors. However, the more I research the more I started to question if this was needed; can we just use the Roc parser instead of maintaining two?
Roc's indentation rules are subtle, so it may be difficult to build and maintain a tree-sitter parser that works correctly. Keywords are particularly challenging as they are context-sensitive with many subtle cases to handle.
Proposed Solution. Emit relevant syntax information as a side stream in the parser. This supports a syntax highlighting use case, and is easy to make optional which eliminates any effect on the production compiler.
@Joshua Warner has recently embarked on a campaign to convert the parser to use combinators. His goal is to have one parser (to rule them all) which can be used in multiple contexts with different behaviour as required. The Parser trait currently has a single parse() method. When everything is combinator-based; it will be possible to have a second parse_for_highlight() method which returns alternate data. In this case, it returns a list of syntax relevant locations in the buffer, i.e. Vec<Loc<SyntaxElement>>.
To do this once the parser is fully converted to combinators; we only need to implement logic in the combinators. We can capture the Loc information for relevant syntax such as; keywords, builtins, constants, literals, strings, comments, operators, and other characters (,: etc. Then when we serialise the text buffer, we can insert <span> and </span> wherever needed.
Note that there is a signifcant amount of up front work before we see any externally-visible progress, however, I think this is a worthwhile objective to work towards.
that sounds great! :smiley:
Also regarding the use case of highlighting for Github from my research it looks like _Syntax highlighting in GitHub is performed using TextMate-compatible grammars_ whichare based on regex. When a language gets to 200 unique repositories in the wild the team there will support it. Tree-sitter isn't supported in Linguist, so it's mostly a useful alternative for editing support in text editors before the roc editor gets up and running. Just thought I would share here for anyone interested.
Thanks for the writeup @Luke Boswell !
Here's a quick proof-of-concept sketch for how I was thinking about implementing this idea: https://github.com/roc-lang/roc/pull/4608
There are some key requirements in tree-sitter that we need for the editor's free/legacy editing mode:
So my questions are:
Preserving spacing and comments would also be important to the parser a formatter would perhaps need
Incremental parsing is a very similar transform to what @Luke Boswell and I are suggesting for generating tokens for highlighting - we have a third method that all the combinators implement, which caches a map from the observable parts of the State to the parsed return value. For incremental parsing, we just need to invalidate the cached return values that overlap with the edited span and run parsing again. This turns parsing from O(n) in the input length to O(d) in the tree depth.
We'll also need to make sure the Regions from cached AST results still work. This will require a little finangling, but definitely possible.
Error recovery is similar - where after an error we need the parser to be able to skip over some number of unrecognized characters and/or close some number of AST nodes until parsing can continue without error. The main thing that requires on top of what we've talked about up until now, is that the AST needs to be able to represent errors at arbitrary levels - something like formalizing the existing MalformedIdent / etc.
Will fulfilling these requirements degrade the experience for the parsing of files for compilation?
No; very few intermediate combinators need to be introduced to support this, so the overhead should be minimal on the compilation path. And even then, my hope is that compiler inlining can completley eliminate that overhead - but it certainly warrants measurement.
Preserve all whitespace
I'm not sure it's actually necessary to do this. If you have a concrete syntax tree that has nodes for each of the paren characters and the identifier in ( thingy ), with a Loc for each of them - it's easy to infer from the Loc's that there's whitespace between the parens and the ident. You don't need an extra thing in the tree to tell you that.
We already preserve (most) newlines/comments in the AST.
Alright, thanks for addressing my concerns @Joshua Warner :)
If we're planning on doing major work on the Parser it might be a good time to see if this new Parser AST can be used by the editor.
@Richard Feldman @Folkert de Vries I know we originally planned to use the canonical AST for the editor but if I remember correctly, that one does not contain comments, which we do need for the editor.
I think the editor's AST would need to consist of NodeIds, instead of having the actual values in the tree. This way a plugin can pass an update for a specific NodeId. With that requirement, we could use NodeIds in the parse AST or we'd need to convert from the NodeIds to the parse AST on every edit.
I'm not making any formal proposals here, just thinking through some options.
FWIW, at some point I want to start code-genning the AST structs/enums. Probably based on a description in Roc code, because why not? It should be simple to also codegen something that removes all comment/newline information for canonicalization. Similarly, there could be an "editor-friendly" variant that also gets codegenned at the same time, with conversion routines.
Ooh, that sounds interesting :)
yeah, we maybe should revisit parser data structures. I saw some talk about using tokens recently as well, which is also interesting
when we started, we rejected tokens because we'd only ever see compilers that have a separate tokenization step. It always looked like (because the examples are small) the tokens were materialized. So you would turn the input string into a list of tokens, then parse those tokens
but it turns out that is not how it should work: the array of tokens should never materialize in memory
so, maybe tokens are actually preferable, since there is no performance downside (if done well) and might simplify our parsing logic
also, both for the editor and the compiler, parsing needs to be fast. The data structures we use are a big part of that
I am writing a blog post (for my company's company blog, if you have comments on it, let me know) https://hackmd.io/@Q66MPiW4T7yNTKOCaEb-Lw/ryfenBCO5
where I measure a 12% reduction in memory used in this relatively simple example. I think we could do more of these transforms and achieve more gains. I'd guess that a reduction in memory usage will translate in reduced processing costs during canonicalization
FWIW, zig does a full tokenization pass, preserves the tokens in memory, and then runs the parser over that - and the zig parser is _fast_.
Materializing the tokens in memory in theory opens up opportunities for massively-SIMD-ifying the parser, ala simdjson, except over tokens instead of characters.
Can I please have some feedback or comments on my PR#5018. Is this a worthwhile idea to progress with? any assistance would be most appreciated.
This is a proposal for an interim solution to achieve Roc syntax highlighting in markdown code blocks for our static-site platform, specifically for the use case of the Roc website and tutorial. Longer term options are to have a tree-sitter grammar or to use the Roc compiler to achieve this in a much more reliable way. @Joshua Warner has made a lot of progress recently with the parser and investigated a number of options in this direction which show promising results.
I apologise for my terrible Rust here; I managed to cobble a quick proof of concept together from some examples online. The basic idea is to find and replace all the text for specific Roc keywords with the equivalent html code, i.e. import becomes <span class="kw">import </span>. This is effectively what I have been doing by hand for the examples in our tutorial content.
This certainly seems like a great starting point - dramatically better than not having syntax highlighting (or doing it by hand)!
For a more complete approach, I do still want to eventually make the roc parser usable for highlighting. This enables cool things like positional keywords working correctly and eliminates the possibility of regex-based bugs. I have a few irons in the fire here, but nothing that's quite ready yet - and it doesn't seem worthwhile to keep blocking tutorial work on that (or, making it extra painful).
@Luke Boswell - a few things to watch out for:
Made some excellent progress on this thanks to @Joshua Warner for adding a lexing-based 'highlight' mode to the parser. Here is a PR#5051 which adds syntax highlighting using roc_parse crate.
Also I forgot to add the keywords in the example, but it doesn't highlight any keywords in comments etc which is super nice.
Last updated: Jun 16 2026 at 16:19 UTC