lol-html
gron
Our great sponsors
lol-html | gron | |
---|---|---|
8 | 64 | |
1,390 | 13,483 | |
1.9% | - | |
5.7 | 0.0 | |
about 1 month ago | 6 months ago | |
Rust | Go | |
BSD 3-clause "New" or "Revised" License | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
lol-html
-
Ask HN: A fast, Rust HTML parser that works?
So I'm doing some web scraping in Rust, and so I will need to parse HTML. [scraper](https://docs.rs/scraper/latest/scraper/) (which uses [html5ever](https://github.com/servo/html5ever)) is doing fine except that it's the bottleneck of my application.
So I need a faster parser. I've tried [tl](https://docs.rs/tl/latest/tl/) which would've been perfect except that it doesn't actually work on the HTML I have. When I try to `query_selector` the elements I need, it returns nothing.
[Kuchiki](https://docs.rs/kuchiki/latest/kuchiki/) is abandonded.
I couldn't figure out how to get [lol-html](https://github.com/cloudflare/lol-html) to work for me (it's designed for re-writing HTML, whatever that means). It doesn't seem to have an API to extract the inner text of an element.
[html5gum](https://github.com/untitaker/html5gum) seems to be just an HTML tokenizer, or otherwise just too low-level. I have not yet tried [quick-xml](https://github.com/tafia/quick-xml/) but judging from the README, it's pretty low-level too. I mean, if these are the only options left then I will try them. Otherwise, I would love to use a parser that's faster but as ergonomic as `scraper` or `tl`.
At this point, I would be happy with an Lxml bridge/port of some sort. I don't need to mutate HTML, just parse and read data from it.
-
How much Rust work is actually going on at Cloudflare?
I'm also in the Workers org but I have had a bit of interaction with Rust. There's some Rust in the Workers runtime using lol-html for HTMLRewriter as well as some tooling and there's the full blown workers-rs framework that I work on, but that's about it for the Rust I work on regularly.
- Is there a library for manipulating HTML?
- pup: Parsing HTML at the Command Line
-
Texting Robots: Taming robots.txt with Rust and 34 million tests
Thanks again and happy to answer any questions! My current unreleased Rust projects include a web crawler that uses Tokio + Tokio Console + Reqwest with this crate for robots.txt and a fast text extraction library using lol-html that I am planning to sprinkle with some minimal ML to get Readability.js style intelligent extraction (with training in Python). See Fathom for an example of the ML approach I'll likely take.
-
Like JQ, but for HTML
I’d like to see a tool using lol-html [0] and their CSS selector API as a streaming HTML editor.
[0] https://github.com/cloudflare/lol-html
- Things you can’t do in Rust (and what to do instead)
-
Problems with building a backend app in Rust in 2020
Cloudflare has open sourced lol-html, a "Low output latency streaming HTML parser/rewriter with CSS selector-based API". Is that what you are looking for?
gron
-
Frawk: An efficient Awk-like programming language. (2021)
gron (https://github.com/tomnomnom/gron) to transform it and query and then invert the transformation?
- Show HN: Flatito, grep for YAML and JSON files
- Gron: Make JSON greppable
-
Make JSON Greppable
It buffers all of its output statements in memory before writing to stdout:
https://github.com/tomnomnom/gron/blob/master/main.go#L204
- Ask HN: What are some unpopular technologies you wish people knew more about?
-
Jaq – A jq clone focused on correctness, speed, and simplicity
Have you tried `gron`?
It converts your nested json into a line by line format which plays better with tools like `grep`
From the project's README:
▶ gron "https://api.github.com/repos/tomnomnom/gron/commits?per_page..." | fgrep "commit.author"
json[0].commit.author = {};
json[0].commit.author.date = "2016-07-02T10:51:21Z";
json[0].commit.author.email = "[email protected]";
json[0].commit.author.name = "Tom Hudson";
https://github.com/tomnomnom/gron
It was suggested to me in HN comments on an article I wrote about `jq`, and I have found myself using it a lot in my day to day workflow
-
Interactive Examples for Learning Jq
> So all I want is a tool to go from json => line oriented and I will do the rest with the vast library of experience I already have at transformations on the command line.*
The tool for that is likely https://github.com/tomnomnom/gron
-
Modern Linux Tools vs. Unix Classics: Which Would I Choose?
If JQ is too much, see GRON &| Miller
gron transforms JSON into discrete assignments to make it easier to grep for what you want https://github.com/tomnomnom/gron
Miller is like awk, sed, cut, join, and sort for data formats such as CSV, TSV, JSON, JSON https://github.com/johnkerl/miller
- XML is better than YAML
-
jq 1.7 Released
And jless [1] and gron [2].
This is the first I'm hearing of gron, but adding here for completeness sake. Meanwhile, JSON seems to be becoming a standard for CLI tools. Ideal scenario would be if every CLI tool has a --json flag or something similar, so that jc is not needed anymore.
[1] https://jless.io/
[2] https://github.com/tomnomnom/gron
What are some alternatives?
actor-rust-scraper - Experimental scraper in Rust suited for running locally or on the Apify platform. Inspired by Apify SDK.
jq - Command-line JSON processor [Moved to: https://github.com/jqlang/jq]
tq - Perform a lookup by CSS selector on an HTML input
jfq - JSONata on the command line
yq - Command-line YAML, XML, TOML processor - jq wrapper for YAML/XML/TOML documents
xidel - Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.
tools - all-in collection of productivity scripts, CLI tools, utility libraries, fuse filesystems, and also some stuff
pup - Parsing HTML at the command line
hq - lightweight command line HTML processor using CSS and XPath selectors
JsonPath - Java JsonPath implementation
cargo-expand - Subcommand to show result of macro expansion
fx - Terminal JSON viewer & processor