html5gum
parse5
html5gum | parse5 | |
---|---|---|
3 | 4 | |
146 | 3,569 | |
- | - | |
6.8 | 9.1 | |
about 2 months ago | 1 day ago | |
Rust | TypeScript | |
MIT License | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
html5gum
-
Ask HN: A fast, Rust HTML parser that works?
So I'm doing some web scraping in Rust, and so I will need to parse HTML. [scraper](https://docs.rs/scraper/latest/scraper/) (which uses [html5ever](https://github.com/servo/html5ever)) is doing fine except that it's the bottleneck of my application.
So I need a faster parser. I've tried [tl](https://docs.rs/tl/latest/tl/) which would've been perfect except that it doesn't actually work on the HTML I have. When I try to `query_selector` the elements I need, it returns nothing.
[Kuchiki](https://docs.rs/kuchiki/latest/kuchiki/) is abandonded.
I couldn't figure out how to get [lol-html](https://github.com/cloudflare/lol-html) to work for me (it's designed for re-writing HTML, whatever that means). It doesn't seem to have an API to extract the inner text of an element.
[html5gum](https://github.com/untitaker/html5gum) seems to be just an HTML tokenizer, or otherwise just too low-level. I have not yet tried [quick-xml](https://github.com/tafia/quick-xml/) but judging from the README, it's pretty low-level too. I mean, if these are the only options left then I will try them. Otherwise, I would love to use a parser that's faster but as ergonomic as `scraper` or `tl`.
At this point, I would be happy with an Lxml bridge/port of some sort. I don't need to mutate HTML, just parse and read data from it.
- html5gum: A WHATWG-compliant HTML5 tokenizer and tag soup parser
parse5
-
error of installing icon library
131 packages are looking for funding run `npm fund` for details 72 vulnerabilities (12 low, 19 moderate, 37 high, 4 critical) To address issues that do not require attention, run: npm audit fix To address all issues (including breaking changes), run: npm audit fix --force Run `npm audit` for details. C:\Users\39388\Desktop\VALU PROCESS\FRONT\ConsultingBag_Frontend-main\ConsultingBag_Frontend-main> C:\Users\39388\Desktop\VALU PROCESS\FRONT\ConsultingBag_Frontend-main\ConsultingBag_Frontend-main>npm fund [email protected] ├─┬ https://opencollective.com/bootstrap │ │ └── [email protected] │ └── https://opencollective.com/popperjs │ └── u/popperjs/[email protected] ├── https://opencollective.com/date-fns │ └── [email protected] ├── https://opencollective.com/formik │ └── [email protected] ├── https://opencollective.com/styled-components │ └── [email protected] ├── https://github.com/sponsors/jacobwgillespie │ └── [email protected], u/styled-icons/[email protected], u/styled-icons/[email protected], u/styled-icons/[email protected], u/styled-icons/[email protected], u/styled-icons/[email protected], u/styled-icons/[email protected], u/styled-icons/[email protected], u/styled-icons/[email protected], u/styled-icons/[email protected], u/styled-icons/[email protected], u/styled-icons/[email protected], u/styled-icons/[email protected], u/styled-icons/[email protected], u/styled-icons/[email protected], u/styled-icons/[email protected], u/styled-icons/[email protected], u/styled-icons/[email protected], u/styled-icons/[email protected], u/styled-icons/[email protected], u/styled-icons/[email protected], u/styled-icons/[email protected], u/styled-icons/[email protected], u/styled-icons/[email protected], u/styled-icons/[email protected], u/styled-icons/[email protected], u/styled-icons/[email protected], u/styled-icons/[email protected], u/styled-icons/[email protected], u/styled-icons/[email protected], u/styled-icons/[email protected], u/styled-icons/[email protected], u/styled-icons/[email protected], u/styled-icons/[email protected], u/styled-icons/[email protected], u/styled-icons/[email protected], u/styled-icons/[email protected], u/styled-icons/[email protected] ├─┬ https://github.com/chalk/chalk?sponsor=1 │ │ └── [email protected] │ └── https://github.com/chalk/ansi-styles?sponsor=1 │ └── [email protected], [email protected] ├── https://github.com/sponsors/RubenVerborgh │ └── [email protected] ├── https://github.com/chalk/wrap-ansi?sponsor=1 │ └── [email protected] ├── https://opencollective.com/core-js │ └── [email protected], [email protected] ├─┬ https://opencollective.com/babel │ │ └── u/babel/[email protected] │ └── https://opencollective.com/browserslist │ └── [email protected], [email protected], [email protected] ├── https://github.com/sponsors/ljharb │ └── [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] ├─┬ https://github.com/inikulin/parse5?sponsor=1 │ │ └── [email protected] │ └── https://github.com/fb55/entities?sponsor=1 │ └── [email protected], [email protected] ├── https://github.com/sponsors/fb55 │ └── [email protected], [email protected], [email protected], [email protected] ├── https://github.com/sponsors/sindresorhus │ └── [email protected], [email protected] ├── https://github.com/sponsors/epoberezkin │ └── [email protected] ├── https://github.com/sponsors/isaacs │ └── [email protected] ├── https://github.com/fb55/htmlparser2?sponsor=1 │ └── [email protected] ├── https://opencollective.com/postcss/ │ └── [email protected], [email protected] ├── https://github.com/sponsors/wooorm │ └── [email protected] ├── https://tidelift.com/funding/github/npm/autoprefixer │ └── [email protected] ├── https://github.com/sponsors/feross │ └── [email protected], [email protected], [email protected] ├─┬ https://paulmillr.com/funding/ │ │ └── [email protected] │ └── https://github.com/sponsors/jonschlinkert │ └── [email protected] └── https://tidelift.com/funding/github/npm/loglevel └── [email protected]
-
casperjs, phantomjs, what is not going to be abandonware?
A relatively stable option would probably be to just use puppeteer directly to spawn a headless chrome, and extract the html that way. If you want to parse the html, I recommend feeding that into parse5.
-
Getting Started with Deno
After some googling, I landed on parse5 which appeared to have wide usage and offered a simple, low-level tree API at its core.
-
How does session replay work Part1: Serialization
We do not use existing open-source solutions such as parse5 for two reasons:
What are some alternatives?
sax-wasm - The first streamable, fixed memory XML, HTML, and JSX parser for WebAssembly.
JSONStream
germ - 🦠 The Definitive Gemini Protocol Toolkit
URI.js - Javascript URL mutation library
html5ever - High-performance browser-grade HTML5 parser
xml2js - XML to JavaScript object converter.
quick-xml - Rust high performance xml reader and writer
nearley - 📜🔜🌲 Simple, fast, powerful parser toolkit for JavaScript.
Fuzi - A fast & lightweight XML & HTML parser in Swift with XPath & CSS support
PEG.js - PEG.js: Parser generator for JavaScript
logos - Create ridiculously fast Lexers
json-query - Retrieves values from JSON objects for data binding