Our great sponsors
-
ftr-site-config
Site-specific article extraction rules to aid content extractors, feed readers, and 'read later' applications.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
I've gotten myself a Supernote A5X (awesome device btw) and since it doesn't have a web browser or anything I've wanted to have a way to read news on it. I've hacked together this utility in a couple of days and it works wonders for me personally so I thought it might be interesting to others. It can also be used as a noise free newspaper generator as it removes images/ads/links and other noisy stuff.
https://github.com/lnenad/newser
(there is a screenshot of the first page of the generated pdf)
It scrapes (news) websites for content and puts it into a pdf. For me the pdf location is my dropbox supernote directory so my setup is to run this thing daily and have a fresh pdf with news whenever I want it.
It's rough around the edges probably (currently added crawl support for verge, ars, engadget) but I think it's a good base so if anyone wants to contribute feel free. Some of the stuff I want to add is pictures (maybe), maybe parse the text html to include font styling and other stuff.
I've tried to generalize it as much as possible so the crawling is pretty much automatic and is controlled by a config file where you define "rules" on how to parse the website.
This is great!
If it's useful, I work on a project where we maintain a repository of XPath selectors for extracting article content from many different sites: https://github.com/fivefilters/ftr-site-config - they're based on the original public Instapaper rules.
We also have PDF generation, but it's not really for crawling, and wasn't created for reading on a device like the Supernote, more for printing and reading: https://pdf.fivefilters.org/simple-print/
I had a similar setup for creating PDF files from RSS feeds (https://github.com/adityam/rss2kindle). I was simply downloading the webpage, using pandoc to convert HTML to ConTeXt, and typesetting it via ConTeXt (this gave me a lot of control over the formatting and took care of including external images as well). I had a separate script which emailed the PDF to my kindle address.
The script worked reliably for multiple years until I stopped using the kindle. I now have a SuperNote A6X and both pandoc and context have improved significantly in the last decade, so I should give this another shot.