Reading from the web offline and distraction-free

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • SurveyJS - Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • parser

    📜 Extract meaningful content from the chaos of a web page

    Good luck! Those HTML issues you're coming across are tough and so varied across the web!

    I was working with Mercury Parser (pluggable parsing for different sites) in the past.

    https://github.com/postlight/mercury-parser

  • ricecooker

    Python library for creating Kolibri channels and uploading to Studio

    Very cool.

    The take-any-webpage-offline need is also common in the education space (teachers want to save a webpage and send it to their students as part of a lesson and don't want to worry about availability or ads etc).

    I used to work on tools for this https://github.com/learningequality/ricecooker/blob/develop/... and https://github.com/learningequality/BasicCrawler/blob/master...

  • SurveyJS

    Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App. With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.

  • BasicCrawler

    Basic web crawler that automates website exploration and producing web resource trees.

    Very cool.

    The take-any-webpage-offline need is also common in the education space (teachers want to save a webpage and send it to their students as part of a lesson and don't want to worry about availability or ads etc).

    I used to work on tools for this https://github.com/learningequality/ricecooker/blob/develop/... and https://github.com/learningequality/BasicCrawler/blob/master...

  • zimit

    Make a ZIM file from any Web site and surf offline!

    which worked quite well for most sites, but still very far from a general-purpose solution.

    There is also more powerful/general-purpose scraper that generates a ZIM file here: https://github.com/openzim/zimit

    It would be really nice to a "common" scraper code base that takes care of scraping (possibly with a real headless browser) and outputs all assets as files + info as JSON. This common code base could then be used by all kinds of programs to package the content as standalone HTML zip files, ePub, ZIM, or even PDF for crazy people like me who like to print things ;)

  • url-to-epub

    A simple script that generates an EPUB from a single URL, zero-config

    I do a lot of this work[3] (web to documents) and it's interesting to see other approaches. The medium image problem is something I've faced as well, but never got around to fixing. I'm planning to get a Remarkable soon, so will definitely be trying this out.

    My personal solution has been https://github.com/captn3m0/url-to-epub/ (Node/readability), which I've tested against the entirety of Tor's original fiction collection[0] where it performs well enough (I'm biased). Another tool that does this beautifully well is percollate[1], but it doesn't give enough control of the metadata to the user - something I really care about.

    I've also started to use rdrview[2], which is a C-port of the current Firefox implementation of "reader view". It is very unix-y, so it is easy to pipe content to it (I usually run it through tidy first). Quite helpful in building web-archiving or web-to-pdf or web-to-kindle pipelines easily.

    [0]: https://www.tor.com/category/all-fiction/original-fiction/

    [1]: https://github.com/danburzo/percollate

    [2]: https://github.com/eafer/rdrview

    [3]: https://captnemo.in/ebooks/

  • percollate

    A command-line tool to turn web pages into readable PDF, EPUB, HTML, or Markdown docs.

    I do a lot of this work[3] (web to documents) and it's interesting to see other approaches. The medium image problem is something I've faced as well, but never got around to fixing. I'm planning to get a Remarkable soon, so will definitely be trying this out.

    My personal solution has been https://github.com/captn3m0/url-to-epub/ (Node/readability), which I've tested against the entirety of Tor's original fiction collection[0] where it performs well enough (I'm biased). Another tool that does this beautifully well is percollate[1], but it doesn't give enough control of the metadata to the user - something I really care about.

    I've also started to use rdrview[2], which is a C-port of the current Firefox implementation of "reader view". It is very unix-y, so it is easy to pipe content to it (I usually run it through tidy first). Quite helpful in building web-archiving or web-to-pdf or web-to-kindle pipelines easily.

    [0]: https://www.tor.com/category/all-fiction/original-fiction/

    [1]: https://github.com/danburzo/percollate

    [2]: https://github.com/eafer/rdrview

    [3]: https://captnemo.in/ebooks/

  • rdrview

    Firefox Reader View as a command line tool

    I do a lot of this work[3] (web to documents) and it's interesting to see other approaches. The medium image problem is something I've faced as well, but never got around to fixing. I'm planning to get a Remarkable soon, so will definitely be trying this out.

    My personal solution has been https://github.com/captn3m0/url-to-epub/ (Node/readability), which I've tested against the entirety of Tor's original fiction collection[0] where it performs well enough (I'm biased). Another tool that does this beautifully well is percollate[1], but it doesn't give enough control of the metadata to the user - something I really care about.

    I've also started to use rdrview[2], which is a C-port of the current Firefox implementation of "reader view". It is very unix-y, so it is easy to pipe content to it (I usually run it through tidy first). Quite helpful in building web-archiving or web-to-pdf or web-to-kindle pipelines easily.

    [0]: https://www.tor.com/category/all-fiction/original-fiction/

    [1]: https://github.com/danburzo/percollate

    [2]: https://github.com/eafer/rdrview

    [3]: https://captnemo.in/ebooks/

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts