HTML Parser

Open-source HTML projects categorized as Parser

Top 7 HTML Parser Projects

  • Crafting Interpreters

    Repository for the book "Crafting Interpreters"

  • Project mention: Crafting Interpreters | news.ycombinator.com | 2023-12-26
  • datefinder

    Find dates inside text using Python and get back datetime objects

  • Project mention: Sneller Regex vs Ripgrep | news.ycombinator.com | 2023-05-18

    That's with DFA minimization. Also, '\w' has 311 states while '(?-u)\w' has 5 states.

    I don't have a precise definition of enormous or impractical. Does it matter? I suppose one obvious one is when DFA construction time starts having a significant impact on total search times.

    > Additionally, the results are not the same: the number of matches is not equal to 7882. How could I make `\w` conform to other regex implementations in ripgrep?

    By following UTS#18: https://unicode.org/reports/tr18/#word

    Most regex engines make \w be ASCII-only by default. But most also have a way to opt into Unicode-aware mode. RE2, Go's regexp and ECMAScript are popular regex engines that have no way to change the interpretation of \w.

    > Fair question how regex compilers handle nefarious regexes. Go does not handle NFA with more than 1000 states, and, as you observed, we added some more restrictions when processing the NFA. It can be an interesting academic exercise to find monstrous regexes, but we haven't encountered useful regexes that hit these limits. But I guess you know some...

    It's definitely not academic. People use regexes for lexers. People use big regexes to recognize certain things like email addresses and dates. Here's a real regex used in real software to identify dates in unstructured text for example: https://github.com/akoumjian/datefinder/blob/5376ece0a522c44...

    Otherwise, as I hinted at above, the thing that can make regexes very large very quickly is when you mix Unicode classes with counted repetitions. It doesn't take a lot to make them "big."

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • html-query

    jq, but for HTML

  • traprange

    (Java)A Method to Extract Tabular Content from PDF Files

  • RatS

    Movie Ratings Synchronization with Python

  • jaxon

    Streaming JSON parser for Elixir

  • htoml

    TOML file format parser in Haskell

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

HTML Parser related posts

Index

What are some of the best open-source Parser projects in HTML? This list will help you:

Project Stars
1 Crafting Interpreters 8,103
2 datefinder 625
3 html-query 605
4 traprange 321
5 RatS 254
6 jaxon 193
7 htoml 38

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com