Our great sponsors
-
Readability4J
A Kotlin port of Mozilla‘s Readability. It extracts a website‘s relevant content and removes all clutter from it.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
soup-strainer
A reimplementation of the Readability/Decruft algorithm using BeautifulSoup and html5lib
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
-
go-domdistiller
Go-DomDistiller is a Go port of the DOM Distiller library which implements Reader mode in Chrome for Android and Desktop. It has no dependencies on Chromium and is meant to run as a command line program or on a server.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
My Hacker News client HACK for iOS and Android has a reader mode ability browser. While on iOS, I was able to use the reader mode feature provided by SFSafariViewController, that wasn't available on android.
So I had to read a ton about this. I ended up using a heavily modified Kotlin version of Readability:
https://github.com/dankito/Readability4J
https://play.google.com/store/apps/details?id=com.pranapps.h...
https://apps.apple.com/us/app/id1464477788
It's really nice that browsers offer reader modes, but they are frustratingly incomplete.
Really, let's switch to a user perspective once and consider - what if I always want reader mode? This is technologically complete impossible and all the solutions are a band-aid.
Firefox and others' attempts rely on the page authors' goodwill. But some pages will always attempt to frustrate reader modes.
Alternative approaches for content extraction use machine learning such as [1], but they of course need to be updated for culture- language- and technology-specific changes.
It's a mess and will remain so for the foreseeable future.
[1] https://github.com/dragnet-org/dragnet
I implemented a variation of the Readability algorithm some 9 years ago, in case anyone needs a server-side Python version and is interested in dragging it (kicking and screaming) into the 2020s:
https://github.com/rcarmo/soup-strainer
I haven’t directly compared them, but I have also found mercury parser (https://github.com/postlight/mercury-parser) to be very reliable.
Since it turns a website into very plain (X)HTML it‘s fairly easy to use it to make a browsing proxy or automatically produce epub files for e-readers, which is what I do.
Another approach to completeness could be to remove noise from the original page instead of parsing just the text from it. In the worst case the page isn't changed at all but it's still usable (like when ad blockers miss some ads). [0]
You probably only want always-on reader mode for articles -- and detecting what's an article is another NLP problem. Yet both the completeness and article detection can likely be solved through heuristics in 90% of cases. Maybe it depends on how much the last 10% frustrate you.
[0] Disclaimer: I'm working on a browser extension that does this. https://github.com/lindylearn/unclutter
I've been working on several web extractors project, so I think I could share some of my findings while working on them. Granted it's been several months since I worked on it so I might be forgetting some things.
There are several open source projects for extracting web contents. However, there are three extractors that I've worked with and give us good result:
- readability.js[1], web extractor by Mozilla that used in Firefox.
- dom-distiller[2], web extractor by Chromium team, written in Java.
- trafilatura[3], Python package by Adrien Barbaresi from BBAW[4].
First, readability.js, as expected is the most famous extractor. It's a single file Javascript library with modest 2,000+ lines of code, released under Apache license. Since it's in JS, you can use it wherever you want, either in web page using `script` tag or by using it in Node project.
Next, DomDistiller is extractor that used in Chromium. It uses Java language with whopping 14,000+ lines of code and can only be used as part of Chromium browser, so you can't exactly use it as standalone library or CLI.
Finally, Trafilatura is a Python package released under GPLv3 license. Created in order to build a text databases[5] for NLP research, it mainly intended for German web pages. However, as development continues, it works really great with other languages. It's a bit slow though compared to Readability.js.
All of those three work in similar way: extract metadata, remove unneeded contents, and finally returns the cleaned up content. Their differences (that I remembered) are:
- In Readability, they insist to make no special rules for any website, while DomDistiller and Trafilatura give a small exception for popular sites like Wikipedia. Thanks to this, if you use Readability.js in Wikipedia pages, it will shows `[edit]` button thorough the extracted content.
- Readability has a small function to detect whether a web page can be converted to reader mode. While it's not really accurate, it's quite convenient to have.
- In DomDistiller, the metadata extraction is more thorough than the others. It supports OpenGraph, Schema.org, and even the old IE Reading View mark up tags.
- Since DomDistiller is only usable within Chromium, it has the advantage to be able to use CSS styling to determine if an element is important or not. If an element is styled to be invisible (e.g. `display: none`) then it will be deemed unimportant. However, according to a research[6] this step is actually doesn't really affect the extraction result.
- DomDistiller also has an experimental feature to find and extract next page in sites that separated its article to several partial pages.
- For Trafilatura, since it was created for collecting web corpus, it main ability is extracting text and the publication date of a web page. For the latter, they've created a Python package named htmldate[7] whose only purpose is to extract the publication or modification date for a web page.
- Trafilatura also has an experimental feature to remove elements that repeated too often. The idea is if the element occured too often, then it's not important to the reader.
I've found benchmark[8] that compare the performance between the extractors, and it said that Trafilatura has the best accuracy compared to the others. However, before you start rushing to use Trafilatura, you should remember that Trafilatura is intended for gathering web corpus, so it's really great for extracting text content, but IIRC is not as good as Readability.js and DomDistiller for extracting a proper article with images and embedded iframes (depending on how you look, it could be a feature though).
By the way, if you are using Go and need to use a web extractor, I already ported all three of them to Go[9][10][11] including their dependencies[12][13], so have fun with it.
[1]: https://github.com/mozilla/readability
For those wondering if there's a redability lib in their favorite language. Here's a list of them all (as far as i know) plus the original arc-90 implementation
https://github.com/masukomi/arc90-readability/#readability
Please submit a PR if there's something i don't have listed there.
Related posts
- Show HN: Build AI Dags with Memory; Run and Validate LLM Tools in Containers
- Powerful and free scraper with a headless browser under the hood and Readability for parsing
- I made a Chrome Extension that lets you ask any question about the page you are on (bluf.ai)
- Testing fast installation in tear-down environment
- Advice on standard design pattern for comparison test script