How does Firefox's Reader View work?

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

Readability4J

3 135 4.3 HTML

A Kotlin port of Mozilla‘s Readability. It extracts a website‘s relevant content and removes all clutter from it.

My Hacker News client HACK for iOS and Android has a reader mode ability browser. While on iOS, I was able to use the reader mode feature provided by SFSafariViewController, that wasn't available on android.
So I had to read a ton about this. I ended up using a heavily modified Kotlin version of Readability:
https://github.com/dankito/Readability4J
https://play.google.com/store/apps/details?id=com.pranapps.h...
https://apps.apple.com/us/app/id1464477788

dragnet

5 1,220 0.0 Python

Just the facts -- web page content extraction

It's really nice that browsers offer reader modes, but they are frustratingly incomplete.
Really, let's switch to a user perspective once and consider - what if I always want reader mode? This is technologically complete impossible and all the solutions are a band-aid.
Firefox and others' attempts rely on the page authors' goodwill. But some pages will always attempt to frustrate reader modes.
Alternative approaches for content extraction use machine learning such as [1], but they of course need to be updated for culture- language- and technology-specific changes.
It's a mess and will remain so for the foreseeable future.
[1] https://github.com/dragnet-org/dragnet

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
soup-strainer

1 33 10.0 Python

A reimplementation of the Readability/Decruft algorithm using BeautifulSoup and html5lib

I implemented a variation of the Readability algorithm some 9 years ago, in case anyone needs a server-side Python version and is interested in dragging it (kicking and screaming) into the 2020s:
https://github.com/rcarmo/soup-strainer

parser

12 5,245 1.1 JavaScript

📜 Extract meaningful content from the chaos of a web page

I haven’t directly compared them, but I have also found mercury parser (https://github.com/postlight/mercury-parser) to be very reliable.
Since it turns a website into very plain (X)HTML it‘s fairly easy to use it to make a browsing proxy or automatically produce epub files for e-readers, which is what I do.

unclutter

39 1,203 8.1 TypeScript

A modern reader mode and article library for your browser.

Another approach to completeness could be to remove noise from the original page instead of parsing just the text from it. In the worst case the page isn't changed at all but it's still usable (like when ad blockers miss some ads). [0]
You probably only want always-on reader mode for articles -- and detecting what's an article is another NLP problem. Yet both the completeness and article detection can likely be solved through heuristics in 90% of cases. Maybe it depends on how much the last 10% frustrate you.
[0] Disclaimer: I'm working on a browser extension that does this. https://github.com/lindylearn/unclutter

readability

51 8,056 6.3 JavaScript

A standalone version of the readability lib

I've been working on several web extractors project, so I think I could share some of my findings while working on them. Granted it's been several months since I worked on it so I might be forgetting some things.
There are several open source projects for extracting web contents. However, there are three extractors that I've worked with and give us good result:
- readability.js[1], web extractor by Mozilla that used in Firefox.
- dom-distiller[2], web extractor by Chromium team, written in Java.
- trafilatura[3], Python package by Adrien Barbaresi from BBAW[4].
First, readability.js, as expected is the most famous extractor. It's a single file Javascript library with modest 2,000+ lines of code, released under Apache license. Since it's in JS, you can use it wherever you want, either in web page using `script` tag or by using it in Node project.
Next, DomDistiller is extractor that used in Chromium. It uses Java language with whopping 14,000+ lines of code and can only be used as part of Chromium browser, so you can't exactly use it as standalone library or CLI.
Finally, Trafilatura is a Python package released under GPLv3 license. Created in order to build a text databases[5] for NLP research, it mainly intended for German web pages. However, as development continues, it works really great with other languages. It's a bit slow though compared to Readability.js.
All of those three work in similar way: extract metadata, remove unneeded contents, and finally returns the cleaned up content. Their differences (that I remembered) are:
- In Readability, they insist to make no special rules for any website, while DomDistiller and Trafilatura give a small exception for popular sites like Wikipedia. Thanks to this, if you use Readability.js in Wikipedia pages, it will shows `[edit]` button thorough the extracted content.
- Readability has a small function to detect whether a web page can be converted to reader mode. While it's not really accurate, it's quite convenient to have.
- In DomDistiller, the metadata extraction is more thorough than the others. It supports OpenGraph, Schema.org, and even the old IE Reading View mark up tags.
- Since DomDistiller is only usable within Chromium, it has the advantage to be able to use CSS styling to determine if an element is important or not. If an element is styled to be invisible (e.g. `display: none`) then it will be deemed unimportant. However, according to a research[6] this step is actually doesn't really affect the extraction result.
- DomDistiller also has an experimental feature to find and extract next page in sites that separated its article to several partial pages.
- For Trafilatura, since it was created for collecting web corpus, it main ability is extracting text and the publication date of a web page. For the latter, they've created a Python package named htmldate[7] whose only purpose is to extract the publication or modification date for a web page.
- Trafilatura also has an experimental feature to remove elements that repeated too often. The idea is if the element occured too often, then it's not important to the reader.
I've found benchmark[8] that compare the performance between the extractors, and it said that Trafilatura has the best accuracy compared to the others. However, before you start rushing to use Trafilatura, you should remember that Trafilatura is intended for gathering web corpus, so it's really great for extracting text content, but IIRC is not as good as Readability.js and DomDistiller for extracting a proper article with images and embedded iframes (depending on how you look, it could be a feature though).
By the way, if you are using Go and need to use a web extractor, I already ported all three of them to Go[9][10][11] including their dependencies[12][13], so have fun with it.
[1]: https://github.com/mozilla/readability

dom-distiller

3 594 0.0 Java

Discontinued Distills the DOM
InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
trafilatura

13 2,778 8.7 Python

Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
htmldate

1 106 7.6 Python

Fast and robust date extraction from web pages, with Python or on the command-line
article-extraction-benchmark

1 242 0.0 Python

Article extraction benchmark: dataset and evaluation scripts
go-domdistiller

1 54 2.9 Go

Go-DomDistiller is a Go port of the DOM Distiller library which implements Reader mode in Chrome for Android and Desktop. It has no dependencies on Chromium and is meant to run as a command line program or on a server.
go-trafilatura

1 31 7.9 HTML

go-trafilatura is a Go port of the trafilatura Python library.
go-htmldate

1 4 5.6 HTML

CLI and Go package for extracting publication date of a web pages.
go-dateparser

1 37 5.5 Go

go parser for human readable dates ported from the dateparser python package
arc90-readability

4 202 0.0 PHP

A copy of the original Arc90 repo with links to many of the current ports.

For those wondering if there's a redability lib in their favorite language. Here's a list of them all (as far as i know) plus the original arc-90 implementation
https://github.com/masukomi/arc90-readability/#readability
Please submit a PR if there's something i don't have listed there.

SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Show HN: Build AI Dags with Memory; Run and Validate LLM Tools in Containers
2 projects | news.ycombinator.com | 21 Apr 2023
Powerful and free scraper with a headless browser under the hood and Readability for parsing
2 projects | /r/webscraping | 18 Mar 2023
I made a Chrome Extension that lets you ask any question about the page you are on (bluf.ai)
2 projects | /r/SideProject | 6 Mar 2023
Testing fast installation in tear-down environment
1 project | /r/learnpython | 6 Jul 2022
Advice on standard design pattern for comparison test script
1 project | /r/learnpython | 24 May 2022

How does Firefox's Reader View work?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Readability NLP web-scraping Web Content Extracting Metadata
Post date: 30 Mar 2022

Readability4J

dragnet

WorkOS

soup-strainer

parser

unclutter

readability

dom-distiller

InfluxDB

trafilatura

htmldate

article-extraction-benchmark

go-domdistiller

go-trafilatura

go-htmldate

go-dateparser

arc90-readability

SaaSHub

Related posts

How does Firefox's Reader View work?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Readability NLP web-scraping Web Content Extracting Metadata Post date: 30 Mar 2022

Related posts

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Readability NLP web-scraping Web Content Extracting Metadata
Post date: 30 Mar 2022