lol-html vs readability

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

lol-html		readability
	Project
8	Mentions	51
1,390	Stars	8,056
1.9%	Growth	7.4%
5.7	Activity	6.3
about 1 month ago	Latest Commit	5 days ago
Rust	Language	JavaScript
BSD 3-clause "New" or "Revised" License	License	GNU General Public License v3.0 or later

The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

lol-html

Posts with mentions or reviews of lol-html. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-02-23.

Ask HN: A fast, Rust HTML parser that works?
4 projects | news.ycombinator.com | 23 Feb 2023

So I'm doing some web scraping in Rust, and so I will need to parse HTML. [scraper](https://docs.rs/scraper/latest/scraper/) (which uses [html5ever](https://github.com/servo/html5ever)) is doing fine except that it's the bottleneck of my application.
So I need a faster parser. I've tried [tl](https://docs.rs/tl/latest/tl/) which would've been perfect except that it doesn't actually work on the HTML I have. When I try to `query_selector` the elements I need, it returns nothing.
[Kuchiki](https://docs.rs/kuchiki/latest/kuchiki/) is abandonded.
I couldn't figure out how to get [lol-html](https://github.com/cloudflare/lol-html) to work for me (it's designed for re-writing HTML, whatever that means). It doesn't seem to have an API to extract the inner text of an element.
[html5gum](https://github.com/untitaker/html5gum) seems to be just an HTML tokenizer, or otherwise just too low-level. I have not yet tried [quick-xml](https://github.com/tafia/quick-xml/) but judging from the README, it's pretty low-level too. I mean, if these are the only options left then I will try them. Otherwise, I would love to use a parser that's faster but as ergonomic as `scraper` or `tl`.
At this point, I would be happy with an Lxml bridge/port of some sort. I don't need to mutate HTML, just parse and read data from it.
How much Rust work is actually going on at Cloudflare?
2 projects | /r/rust | 15 Jan 2023

I'm also in the Workers org but I have had a bit of interaction with Rust. There's some Rust in the Workers runtime using lol-html for HTMLRewriter as well as some tooling and there's the full blown workers-rs framework that I work on, but that's about it for the Rust I work on regularly.
Is there a library for manipulating HTML?
3 projects | /r/rust | 17 Dec 2022
pup: Parsing HTML at the Command Line
7 projects | news.ycombinator.com | 30 Nov 2022
Texting Robots: Taming robots.txt with Rust and 34 million tests
4 projects | /r/rust | 28 Mar 2022

Thanks again and happy to answer any questions! My current unreleased Rust projects include a web crawler that uses Tokio + Tokio Console + Reqwest with this crate for robots.txt and a fast text extraction library using lol-html that I am planning to sprinkle with some minimal ML to get Readability.js style intelligent extraction (with training in Python). See Fathom for an example of the ML approach I'll likely take.
Like JQ, but for HTML
21 projects | news.ycombinator.com | 7 Sep 2021

I’d like to see a tool using lol-html [0] and their CSS selector API as a streaming HTML editor.
[0] https://github.com/cloudflare/lol-html
Things you can’t do in Rust (and what to do instead)
6 projects | news.ycombinator.com | 15 May 2021
Problems with building a backend app in Rust in 2020
2 projects | /r/rust | 21 Dec 2020

Cloudflare has open sourced lol-html, a "Low output latency streaming HTML parser/rewriter with CSS selector-based API". Is that what you are looking for?

readability

Posts with mentions or reviews of readability. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2024-02-25.

Mozilla: Readability.js
8 projects | news.ycombinator.com | 25 Feb 2024
CSS for readability
3 projects | /r/webdev | 9 Dec 2023

I'm working with the Mozilla's readability library https://github.com/mozilla/readability to get the "readable" text from articles and now I want to style the extracted text in a readable way.
Building a Serverless Reader View with Lambda and Chrome
5 projects | dev.to | 25 Sep 2023

Do you remember the Firefox Reader View? It's a feature that removes all unnecessary components like buttons, menus, images, and so on, from a website, focusing on the readable content of the page. The library powering this feature is called Readability.js, which is open source.
Webrecorder: Capture interactive websites and replay them at a later time
6 projects | news.ycombinator.com | 1 Aug 2023

I wonder if Firefox "reader mode as a utility" might be a viable alternative for Pinboard like "content oriented" archiving?
https://github.com/mozilla/readability
Creating an advanced search engine with PostgreSQL
9 projects | news.ycombinator.com | 12 Jul 2023

Depending upon the type of content, one might want to look into using the Readability (Browder's reader view) to parse the webpage. It will give you all the useful info without the junk. Then you can put it in the DB as needed.
https://github.com/mozilla/readability
Btw, readability, is also available in few other languages like Kotlin:
https://github.com/dankito/Readability4J
Seeking a tool or method to convert webpages into Q&A format using NLP
1 project | /r/LanguageTechnology | 10 Jun 2023

Use Mozilla's Readability to extract that sweet, sweet text content from webpages.
I built a free prompt managing tool - Knit
2 projects | /r/ChatGPTPromptGenius | 8 Jun 2023

Same as above but the ability to grab the entire article text (you can use the Readability library for that: https://github.com/mozilla/readability)
I need automatic source URLs when I paste any text onto a card or note, like on OneNote.
4 projects | /r/ObsidianMD | 20 Apr 2023

// Original script // https://gist.github.com/kepano/90c05f162c37cf730abb8ff027987ca3 // Bookmarklet Converter // https://caiorss.github.io/bookmarklet-maker/ // Libraries // https://github.com/mixmark-io/turndown // https://github.com/mozilla/readability javascript: Promise.all([import('https://unpkg.com/[email protected]?module'), import('https://unpkg.com/@tehshrike/[email protected]'), ]).then(async ([{ default: Turndown }, { default: Readability }]) => { /* Optional vault name */ const vault = ""; /* Optional folder name such as "Clippings/" */ const folder = "Clippings/"; /* Optional tags */ const tags = ""; function getSelectionHtml() { var html = ""; if (typeof window.getSelection != "undefined") { var sel = window.getSelection(); if (sel.rangeCount) { var container = document.createElement("div"); for (var i = 0, len = sel.rangeCount; i < len; ++i) { container.appendChild(sel.getRangeAt(i).cloneContents()); } html = container.innerHTML; } } else if (typeof document.selection != "undefined") { if (document.selection.type == "Text") { html = document.selection.createRange().htmlText; } } return html; } const selection = getSelectionHtml(); const { title, byline, content } = new Readability(document.cloneNode(true)).parse(); function getFileName(fileName) { var userAgent = window.navigator.userAgent, platform = window.navigator.platform, windowsPlatforms = ['Win32', 'Win64', 'Windows', 'WinCE']; if (windowsPlatforms.indexOf(platform) !== -1) { fileName = fileName.replace(':', '').replace(/[/\\?%*|"<>]/g, '-'); } else { fileName = fileName.replace(':', '').replace(/\//g, '-').replace(/\\/g, '-'); } return fileName; } const fileName = getFileName(title); if (selection) { var markdownify = selection; } else { var markdownify = content; } if (vault) { var vaultName = '&vault=' + encodeURIComponent(`${vault}`); } else { var vaultName = ''; } const markdownBody = new Turndown({ headingStyle: 'atx', hr: '---', bulletListMarker: '-', codeBlockStyle: 'fenced', emDelimiter: '*', }).turndown(markdownify); var date = new Date(); function convertDate(date) { var yyyy = date.getFullYear().toString(); var mm = (date.getMonth()+1).toString(); var dd = date.getDate().toString(); var mmChars = mm.split(''); var ddChars = dd.split(''); return yyyy + '-' + (mmChars[1]?mm:"0"+mmChars[0]) + '-' + (ddChars[1]?dd:"0"+ddChars[0]); } const today = convertDate(date); // This is the output template // It is similar to an Obsidian core template // except to insert a value we use: ${value} instead of {{value}} const fileContent =`--- type: clipping date_added: ${today} aliases: [] tags: [${tags}] --- author:: ${byline.toString().split('\n')[0].trim()} source:: [${title}](${document.URL}) ${markdownBody} `; // This copies your text to the clipboard navigator.clipboard.writeText(fileContent); // This creates a new document in Obsidian containing your clipping // I commented it out as this isn't what you asked for /* document.location.href = "obsidian://new?" + "file=" + encodeURIComponent(folder + fileName) + "&content=" + encodeURIComponent(fileContent) + vaultName; */ })
Any js packages to only scrape relevant content from a webpage?
1 project | /r/webscraping | 27 Mar 2023
RSS meets GPT-3
2 projects | /r/rss | 18 Feb 2023

So first part of the task is to "extract the text from URL", and that is achieved by using descendant of https://github.com/mozilla/readability library which can extract text of any URL.

What are some alternatives?

When comparing lol-html and readability you can also consider the following projects:

actor-rust-scraper - Experimental scraper in Rust suited for running locally or on the Apify platform. Inspired by Apify SDK.

parser - 📜 Extract meaningful content from the chaos of a web page

tq - Perform a lookup by CSS selector on an HTML input

koreader - An ebook reader application supporting PDF, DjVu, EPUB, FB2 and many more formats, running on Cervantes, Kindle, Kobo, PocketBook and Android devices

yq - Command-line YAML, XML, TOML processor - jq wrapper for YAML/XML/TOML documents

hn-search - Hacker News Search

tools - all-in collection of productivity scripts, CLI tools, utility libraries, fuse filesystems, and also some stuff

readability.php - PHP port of Mozilla's Readability.js

hq - lightweight command line HTML processor using CSS and XPath selectors

rssguard - Feed reader (and podcast player) which supports RSS/ATOM/JSON and many web-based feed services.

cargo-expand - Subcommand to show result of macro expansion

SponsorBlock - Skip YouTube video sponsors (browser extension)

lol-html vs actor-rust-scraper readability vs parser lol-html vs tq readability vs koreader lol-html vs yq readability vs hn-search lol-html vs tools readability vs readability.php lol-html vs hq readability vs rssguard lol-html vs cargo-expand readability vs SponsorBlock

Compare lol-html vs readability and see what are their differences.

lol-html

readability

lol-html

readability

What are some alternatives?