Our great sponsors
- InfluxDB - Collect and Analyze Billions of Data Points in Real Time
- Onboard AI - Learn any GitHub repo in 59 seconds
- SaaSHub - Software Alternatives and Reviews
-
I think we've all (mostly?) tried it. It really is the Wild West of the web when you're trying to parse other people's HTML, though.
I've played around with this parser which is extremely quick. https://github.com/lexbor/lexbor
-
oil
Oils is our upgrade path from bash to a better language and runtime. It's also for Python and JavaScript users who avoid shell!
All you need to parse HTML is regular expressions (to recognize tags) and a stack (to match tags).
Your programming language has a stack -- a call stack.
So in practice all you really need is regular expressions. (Which I tend to call regular languages to make a distinction with Perl-style regexes [1])
Using the call stack in a more functional style is nicer than using the OOP style that s in the Python standard library, which is probably inherited from Java, etc.
I have done this with a bunch of HTML processors for the Oil blog and doc toolchain:
https://github.com/oilshell/oil/tree/master/doctools
It works well in practice and is correct and fast. I meant to write a blog post titled "why you can parse HTML with regexes" about this but didn't get around to it.
There is a nit where parsing arbitrary name=value pairs with regexes isn't ergonomic, because it's hard to capture a variable number of pairs. However the point is that I wrote 5 or 6 useful and compact HTML processors that don't need that. In practice when you parse HTML you often a "fixed" schema.
Concrete examples are generating a TOC from
,
, etc. and syntax highlighting
blocks. Those all work great with the regex + call stack style.
-
InfluxDB
Collect and Analyze Billions of Data Points in Real Time. Manage all types of time series data in a single, purpose-built database. Run at any scale in any environment in the cloud, on-premises, or at the edge.
-