lol-html
sccache
Our great sponsors
lol-html | sccache | |
---|---|---|
8 | 70 | |
1,390 | 5,347 | |
1.9% | 3.0% | |
5.7 | 9.4 | |
about 1 month ago | 1 day ago | |
Rust | Rust | |
BSD 3-clause "New" or "Revised" License | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
lol-html
-
Ask HN: A fast, Rust HTML parser that works?
So I'm doing some web scraping in Rust, and so I will need to parse HTML. [scraper](https://docs.rs/scraper/latest/scraper/) (which uses [html5ever](https://github.com/servo/html5ever)) is doing fine except that it's the bottleneck of my application.
So I need a faster parser. I've tried [tl](https://docs.rs/tl/latest/tl/) which would've been perfect except that it doesn't actually work on the HTML I have. When I try to `query_selector` the elements I need, it returns nothing.
[Kuchiki](https://docs.rs/kuchiki/latest/kuchiki/) is abandonded.
I couldn't figure out how to get [lol-html](https://github.com/cloudflare/lol-html) to work for me (it's designed for re-writing HTML, whatever that means). It doesn't seem to have an API to extract the inner text of an element.
[html5gum](https://github.com/untitaker/html5gum) seems to be just an HTML tokenizer, or otherwise just too low-level. I have not yet tried [quick-xml](https://github.com/tafia/quick-xml/) but judging from the README, it's pretty low-level too. I mean, if these are the only options left then I will try them. Otherwise, I would love to use a parser that's faster but as ergonomic as `scraper` or `tl`.
At this point, I would be happy with an Lxml bridge/port of some sort. I don't need to mutate HTML, just parse and read data from it.
-
How much Rust work is actually going on at Cloudflare?
I'm also in the Workers org but I have had a bit of interaction with Rust. There's some Rust in the Workers runtime using lol-html for HTMLRewriter as well as some tooling and there's the full blown workers-rs framework that I work on, but that's about it for the Rust I work on regularly.
- Is there a library for manipulating HTML?
- pup: Parsing HTML at the Command Line
-
Texting Robots: Taming robots.txt with Rust and 34 million tests
Thanks again and happy to answer any questions! My current unreleased Rust projects include a web crawler that uses Tokio + Tokio Console + Reqwest with this crate for robots.txt and a fast text extraction library using lol-html that I am planning to sprinkle with some minimal ML to get Readability.js style intelligent extraction (with training in Python). See Fathom for an example of the ML approach I'll likely take.
-
Like JQ, but for HTML
I’d like to see a tool using lol-html [0] and their CSS selector API as a streaming HTML editor.
[0] https://github.com/cloudflare/lol-html
- Things you can’t do in Rust (and what to do instead)
-
Problems with building a backend app in Rust in 2020
Cloudflare has open sourced lol-html, a "Low output latency streaming HTML parser/rewriter with CSS selector-based API". Is that what you are looking for?
sccache
-
Mozilla sccache: cache with cloud storage
Worth noting that the first commit in sccache git repository was in 2014 (https://github.com/mozilla/sccache/commit/115016e0a83b290dc2...). So I suppose that what "happened" happened waay back.
- Welcome to Apache OpenDAL
-
Target file are very huge and running out of storage on mac.
If you have lots of shared dependencies, maybe try sccache?
-
S3 Express Is All You Need
I'm going to set up sccache [0] to use it tomorrow. We use MSVC, so EFS is off the cards.
[0] https://github.com/mozilla/sccache/blob/main/docs/S3.md
- sccache
-
Serde has started shipping precompiled binaries with no way to opt out
I think the primary benefit of pre-built procmacros will be for build servers which don't use a persistent cache (like sccache), since they have to compile all dependencies every time. But IMO improved support for persistent caches would be a better investment compared to adding support for pre-built procmacros.
-
Cache dependencies across crates
Checkout https://github.com/mozilla/sccache
-
Distcc: A fast, free distributed C/C++ compiler
https://github.com/mozilla/sccache is another option which addresses the use cases of both icecream and ccache (and also supports Rust, and cloud storage of artifacts, if those are useful for you)
-
How to fix Rust Coding LARGE files????
That being said a compilation cache, eg the de-facto standard for Rust: sccache (https://github.com/mozilla/sccache) will help to compile and store some of the build artifacts centralized - still for each crate version + build profile (RUSTFLAGS) combination.
-
On the verge of giving up learning Haskell because of the terrible tooling.
That's definitely not my experience. Never had any issue running Rust on Windows. You just download and run rustup-init.exe, then updating is simply a matter of rustup update. Documentation generation is built in (cargo doc) and just a case of annotating code with triple-/ markdown comments and then running that command. sccache works fine for me (just need to set RUSTC_WRAPPER=/path/to/sccache). And the error messages from rustc are by far the best of any compiler I've used. Not sure how they're unhelpful, they tend to explain step-by-step what the problem is and how to fix it.
What are some alternatives?
actor-rust-scraper - Experimental scraper in Rust suited for running locally or on the Apify platform. Inspired by Apify SDK.
ccache - ccache – a fast compiler cache
tq - Perform a lookup by CSS selector on an HTML input
cargo-chef - A cargo-subcommand to speed up Rust Docker builds using Docker layer caching.
yq - Command-line YAML, XML, TOML processor - jq wrapper for YAML/XML/TOML documents
rust-cache - A GitHub Action that implements smart caching for rust/cargo projects
tools - all-in collection of productivity scripts, CLI tools, utility libraries, fuse filesystems, and also some stuff
cache - Cache dependencies and build outputs in GitHub Actions
hq - lightweight command line HTML processor using CSS and XPath selectors
icecream - Distributed compiler with a central scheduler to share build load
cargo-expand - Subcommand to show result of macro expansion
mold - Mold: A Modern Linker 🦠