You can't parse [X]HTML with regex

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

lexbor

10 881 8.5 C

Lexbor is development of an open source HTML Renderer library. https://lexbor.com

I think we've all (mostly?) tried it. It really is the Wild West of the web when you're trying to parse other people's HTML, though.
I've played around with this parser which is extremely quick. https://github.com/lexbor/lexbor

oil

234 2,717 9.9 Python

Oils is our upgrade path from bash to a better language and runtime. It's also for Python and JavaScript users who avoid shell!

All you need to parse HTML is regular expressions (to recognize tags) and a stack (to match tags).
Your programming language has a stack -- a call stack.
So in practice all you really need is regular expressions. (Which I tend to call regular languages to make a distinction with Perl-style regexes [1])
Using the call stack in a more functional style is nicer than using the OOP style that s in the Python standard library, which is probably inherited from Java, etc.
I have done this with a bunch of HTML processors for the Oil blog and doc toolchain:
https://github.com/oilshell/oil/tree/master/doctools
It works well in practice and is correct and fast. I meant to write a blog post titled "why you can parse HTML with regexes" about this but didn't get around to it.
There is a nit where parsing arbitrary name=value pairs with regexes isn't ergonomic, because it's hard to capture a variable number of pairs. However the point is that I wrote 5 or 6 useful and compact HTML processors that don't need that. In practice when you parse HTML you often a "fixed" schema.
Concrete examples are generating a TOC from
,
, etc. and syntax highlighting
 blocks.  Those all work great with the regex + call stack style.[1] http://www.oilshell.org/blog/2020/07/eggex-theory.html

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
CPython

1,310 59,431 10.0 Python

The Python programming language

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Modest: A fast HTML renderer implemented as a pure C99 library
2 projects | news.ycombinator.com | 12 Jan 2024
Created a performance-focused HTML5 parser for Ruby, trying to be API-compatible with Nokogiri
2 projects | /r/ruby | 20 Dec 2022
Lexbor: Fast HTML Renderer library in C
1 project | news.ycombinator.com | 30 Jul 2022
Lexbor: Open-source HTML Renderer library in C
1 project | news.ycombinator.com | 22 Apr 2022
What second language to learn after Python?
3 projects | /r/learnprogramming | 14 May 2021

You can't parse [X]HTML with regex

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
C HTML Fast
Post date: 5 Mar 2021

lexbor

oil

,

, etc. and syntax highlighting
`blocks. Those all work great with the regex + call stack style.[1] http://www.oilshell.org/blog/2020/07/eggex-theory.html`

WorkOS

CPython

Related posts

You can't parse [X]HTML with regex

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com C HTML Fast Post date: 5 Mar 2021

lexbor

oil

,

, etc. and syntax highlighting blocks. Those all work great with the regex + call stack style.[1] http://www.oilshell.org/blog/2020/07/eggex-theory.html

WorkOS

CPython

Related posts

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
C HTML Fast
Post date: 5 Mar 2021

, etc. and syntax highlighting
`blocks. Those all work great with the regex + call stack style.[1] http://www.oilshell.org/blog/2020/07/eggex-theory.html`