How does Firefox's Reader View work?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • Readability4J

    A Kotlin port of Mozilla‘s Readability. It extracts a website‘s relevant content and removes all clutter from it.

  • My Hacker News client HACK for iOS and Android has a reader mode ability browser. While on iOS, I was able to use the reader mode feature provided by SFSafariViewController, that wasn't available on android.

    So I had to read a ton about this. I ended up using a heavily modified Kotlin version of Readability:

    https://github.com/dankito/Readability4J

    https://play.google.com/store/apps/details?id=com.pranapps.h...

    https://apps.apple.com/us/app/id1464477788

  • dragnet

    Just the facts -- web page content extraction

  • It's really nice that browsers offer reader modes, but they are frustratingly incomplete.

    Really, let's switch to a user perspective once and consider - what if I always want reader mode? This is technologically complete impossible and all the solutions are a band-aid.

    Firefox and others' attempts rely on the page authors' goodwill. But some pages will always attempt to frustrate reader modes.

    Alternative approaches for content extraction use machine learning such as [1], but they of course need to be updated for culture- language- and technology-specific changes.

    It's a mess and will remain so for the foreseeable future.

    [1] https://github.com/dragnet-org/dragnet

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • soup-strainer

    A reimplementation of the Readability/Decruft algorithm using BeautifulSoup and html5lib

  • I implemented a variation of the Readability algorithm some 9 years ago, in case anyone needs a server-side Python version and is interested in dragging it (kicking and screaming) into the 2020s:

    https://github.com/rcarmo/soup-strainer

  • parser

    📜 Extract meaningful content from the chaos of a web page

  • I haven’t directly compared them, but I have also found mercury parser (https://github.com/postlight/mercury-parser) to be very reliable.

    Since it turns a website into very plain (X)HTML it‘s fairly easy to use it to make a browsing proxy or automatically produce epub files for e-readers, which is what I do.

  • unclutter

    A modern reader mode and article library for your browser.

  • Another approach to completeness could be to remove noise from the original page instead of parsing just the text from it. In the worst case the page isn't changed at all but it's still usable (like when ad blockers miss some ads). [0]

    You probably only want always-on reader mode for articles -- and detecting what's an article is another NLP problem. Yet both the completeness and article detection can likely be solved through heuristics in 90% of cases. Maybe it depends on how much the last 10% frustrate you.

    [0] Disclaimer: I'm working on a browser extension that does this. https://github.com/lindylearn/unclutter

  • readability

    A standalone version of the readability lib

  • I've been working on several web extractors project, so I think I could share some of my findings while working on them. Granted it's been several months since I worked on it so I might be forgetting some things.

    There are several open source projects for extracting web contents. However, there are three extractors that I've worked with and give us good result:

    - readability.js[1], web extractor by Mozilla that used in Firefox.

    - dom-distiller[2], web extractor by Chromium team, written in Java.

    - trafilatura[3], Python package by Adrien Barbaresi from BBAW[4].

    First, readability.js, as expected is the most famous extractor. It's a single file Javascript library with modest 2,000+ lines of code, released under Apache license. Since it's in JS, you can use it wherever you want, either in web page using `script` tag or by using it in Node project.

    Next, DomDistiller is extractor that used in Chromium. It uses Java language with whopping 14,000+ lines of code and can only be used as part of Chromium browser, so you can't exactly use it as standalone library or CLI.

    Finally, Trafilatura is a Python package released under GPLv3 license. Created in order to build a text databases[5] for NLP research, it mainly intended for German web pages. However, as development continues, it works really great with other languages. It's a bit slow though compared to Readability.js.

    All of those three work in similar way: extract metadata, remove unneeded contents, and finally returns the cleaned up content. Their differences (that I remembered) are:

    - In Readability, they insist to make no special rules for any website, while DomDistiller and Trafilatura give a small exception for popular sites like Wikipedia. Thanks to this, if you use Readability.js in Wikipedia pages, it will shows `[edit]` button thorough the extracted content.

    - Readability has a small function to detect whether a web page can be converted to reader mode. While it's not really accurate, it's quite convenient to have.

    - In DomDistiller, the metadata extraction is more thorough than the others. It supports OpenGraph, Schema.org, and even the old IE Reading View mark up tags.

    - Since DomDistiller is only usable within Chromium, it has the advantage to be able to use CSS styling to determine if an element is important or not. If an element is styled to be invisible (e.g. `display: none`) then it will be deemed unimportant. However, according to a research[6] this step is actually doesn't really affect the extraction result.

    - DomDistiller also has an experimental feature to find and extract next page in sites that separated its article to several partial pages.

    - For Trafilatura, since it was created for collecting web corpus, it main ability is extracting text and the publication date of a web page. For the latter, they've created a Python package named htmldate[7] whose only purpose is to extract the publication or modification date for a web page.

    - Trafilatura also has an experimental feature to remove elements that repeated too often. The idea is if the element occured too often, then it's not important to the reader.

    I've found benchmark[8] that compare the performance between the extractors, and it said that Trafilatura has the best accuracy compared to the others. However, before you start rushing to use Trafilatura, you should remember that Trafilatura is intended for gathering web corpus, so it's really great for extracting text content, but IIRC is not as good as Readability.js and DomDistiller for extracting a proper article with images and embedded iframes (depending on how you look, it could be a feature though).

    By the way, if you are using Go and need to use a web extractor, I already ported all three of them to Go[9][10][11] including their dependencies[12][13], so have fun with it.

    [1]: https://github.com/mozilla/readability

  • dom-distiller

    Discontinued Distills the DOM

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • trafilatura

    Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments

  • htmldate

    Fast and robust date extraction from web pages, with Python or on the command-line

  • article-extraction-benchmark

    Article extraction benchmark: dataset and evaluation scripts

  • go-domdistiller

    Go-DomDistiller is a Go port of the DOM Distiller library which implements Reader mode in Chrome for Android and Desktop. It has no dependencies on Chromium and is meant to run as a command line program or on a server.

  • go-trafilatura

    go-trafilatura is a Go port of the trafilatura Python library.

  • go-htmldate

    CLI and Go package for extracting publication date of a web pages.

  • go-dateparser

    go parser for human readable dates ported from the dateparser python package

  • arc90-readability

    A copy of the original Arc90 repo with links to many of the current ports.

  • For those wondering if there's a redability lib in their favorite language. Here's a list of them all (as far as i know) plus the original arc-90 implementation

    https://github.com/masukomi/arc90-readability/#readability

    Please submit a PR if there's something i don't have listed there.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts