Chumsky, a Rust parser-combinator library with error recovery

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • chumsky

    Write expressive, high-performance parsers with ease.

    Tree-sitter appears to be ultra-focused on producing valid syntax trees really fast. This is great for e.g. syntax highlighting, but suboptimal in cases where you are writing a reference parser for your custom language and want to provide very useful error descriptions. Chumsky seems to be more suited for the latter (and also has a part of tutorial about precedence[0], so it seems to deal at least with that case).

    This overview of parser tradeoffs may be helpful: https://blog.jez.io/tree-sitter-limitations/.

    [0] https://github.com/zesterer/chumsky/blob/master/tutorial.md#...

  • percival

    📝 Web-based, reactive Datalog notebooks for data analysis and visualization

    I haven't written a parser with Chumsky, bit I've played with a little one a bit if you wanna see an example syntax. The error reporting for this project is implemented with `ariadne` which is also really slick.

    Parser: https://github.com/ekzhang/percival/blob/main/crates/perciva...

    Error reporting: https://github.com/ekzhang/percival/blob/main/crates/perciva...

    Datalog playground: https://percival.ink/

    To see an error report, delete some punctuation from one of the Datalog code blocks then press shift-return.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

  • rust

    Empowering everyone to build reliable and efficient software.

    The creator of Chumsky is also quite a big proponent of getting generic associated types stabilized in Rust [0], interestingly enough. They have several comments talking about how GATs were very helpful for Chumsky to model their combinators.

    [0] https://github.com/rust-lang/rust/pull/96709

  • prql

    PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement

    I can vouch for adriane, the error display library that is the sister project of Chomsky.

    We integrated it into PRQL compiler and the errors are beautiful!

    https://github.com/prql/prql/pull/275

  • nom

    Rust parser combinator framework

    Caveats: I've used nom in anger, chumsky hardly at all, and tree-sitter only for prototyping. I'm using it for parsing a DSL, essentially a small programming language.

    The essential difference between nom/chomsky and tree-sitter is that the former are libraries for constructing parsers out of smaller parsers, whereas tree-sitter takes a grammar specification and produces a parser. This may seem small at first, but is a massive difference in practice.

    As far as ergonomics go, that's a rather subjective question. On the surface, the parser combinator libraries seem easier to use. They integrate well with the the host language, so you can stay in the same environment. But this comes with a caveat: parser combinators are a functional programming pattern, and Rust is only kind of a functional language, if you treat it juuuuust right. This will make itself known when your program isn't quite right; I've seen type errors that take up an entire terminal window or more. It's also very difficult to decompose a parser into functions. In the best case, you need to write your functions to be generic over type constraints that are subtle and hard to write. (again, if you get this wrong, the errors are overwhelming) I often give up and just copy the code. I have at times believed that some of these types are impossible to write down in a program (and can only exist in the type inferencer), but I don't know if that's actually true.

    deep breath

    Tree-sitter's user interface is rather different. You write your grammar in a javascript internal dsl, which gets run and produces a json file, and then a code generator reads that and produces C source code (I think the codegen is now written in rust). This is a much more roundabout way of getting to a parser, but it's worth it because: (1) tree-sitter was designed for parsing programming languages while nom very clearly was not, and (2) the parsers it generates are REALLY GOOD. Tree-sitter knows operator precedence, where nom cannot do this natively (there's a PR open for the next version: https://github.com/Geal/nom/pull/1362) Tree-sitter's parsing algorithm (GLR) is tolerant to recursion patterns that will send a parser combinator library off into the weeds, unless it uses special transformations to accommodate them.

    It might sound like I'm shitting on nom here, but that's not the goal. It's a fantastic piece of work, and I've gotten a lot of value from it. But it's not for parsing programming languages. Reach for nom when you want to parse a binary file or protocol.

    As for chumsky: the fact that it's a parser combinator library in Rust means that it's going to be subject to a lot of the same issues as nom, fundamentally. That's why I'm targeting tree-sitter next.

    There's no reason tree-sitter grammars couldn't be written in an internal DSL, perhaps in parser-combinator style (https://github.com/engelberg/instaparse does this). That could smooth over a lot of the rough edges.

  • instaparse

    Caveats: I've used nom in anger, chumsky hardly at all, and tree-sitter only for prototyping. I'm using it for parsing a DSL, essentially a small programming language.

    The essential difference between nom/chomsky and tree-sitter is that the former are libraries for constructing parsers out of smaller parsers, whereas tree-sitter takes a grammar specification and produces a parser. This may seem small at first, but is a massive difference in practice.

    As far as ergonomics go, that's a rather subjective question. On the surface, the parser combinator libraries seem easier to use. They integrate well with the the host language, so you can stay in the same environment. But this comes with a caveat: parser combinators are a functional programming pattern, and Rust is only kind of a functional language, if you treat it juuuuust right. This will make itself known when your program isn't quite right; I've seen type errors that take up an entire terminal window or more. It's also very difficult to decompose a parser into functions. In the best case, you need to write your functions to be generic over type constraints that are subtle and hard to write. (again, if you get this wrong, the errors are overwhelming) I often give up and just copy the code. I have at times believed that some of these types are impossible to write down in a program (and can only exist in the type inferencer), but I don't know if that's actually true.

    deep breath

    Tree-sitter's user interface is rather different. You write your grammar in a javascript internal dsl, which gets run and produces a json file, and then a code generator reads that and produces C source code (I think the codegen is now written in rust). This is a much more roundabout way of getting to a parser, but it's worth it because: (1) tree-sitter was designed for parsing programming languages while nom very clearly was not, and (2) the parsers it generates are REALLY GOOD. Tree-sitter knows operator precedence, where nom cannot do this natively (there's a PR open for the next version: https://github.com/Geal/nom/pull/1362) Tree-sitter's parsing algorithm (GLR) is tolerant to recursion patterns that will send a parser combinator library off into the weeds, unless it uses special transformations to accommodate them.

    It might sound like I'm shitting on nom here, but that's not the goal. It's a fantastic piece of work, and I've gotten a lot of value from it. But it's not for parsing programming languages. Reach for nom when you want to parse a binary file or protocol.

    As for chumsky: the fact that it's a parser combinator library in Rust means that it's going to be subject to a lot of the same issues as nom, fundamentally. That's why I'm targeting tree-sitter next.

    There's no reason tree-sitter grammars couldn't be written in an internal DSL, perhaps in parser-combinator style (https://github.com/engelberg/instaparse does this). That could smooth over a lot of the rough edges.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts