How to scrape entire blogs with content?

This page summarizes the projects mentioned and recommended in the original post on reddit.com/r/webscraping

Our great sponsors
  • OPS - Build and Run Open Source Unikernels
  • Scout APM - Less time debugging, more time building
  • SonarQube - Static code analysis for 29 languages.
  • readability

    A standalone version of the readability lib

    If you're familiar with javascript/typescript take a look at https://github.com/mozilla/readability, it extracts article content from web pages. To automate the process you can use https://github.com/puppeteer/puppeteer to control Chrome and inject the mozilla code in each page you want to scrape.

  • puppeteer

    Headless Chrome Node.js API

    If you're familiar with javascript/typescript take a look at https://github.com/mozilla/readability, it extracts article content from web pages. To automate the process you can use https://github.com/puppeteer/puppeteer to control Chrome and inject the mozilla code in each page you want to scrape.

  • OPS

    OPS - Build and Run Open Source Unikernels. Quickly and easily build and deploy open source unikernels in tens of seconds. Deploy in any language to any cloud.

  • scraper

    Open source nodejs web scraper. It scrapes, stores and exports data. Use it from your own javascript/typescript code, via command line or docker container. Supports multiple storage options: SQLite, MySQL, PostgreSQL. Supports multiple browser or dom-like clients: Puppeteer, Playwright, Cheerio, JSdom. (by get-set-fetch)

    You can use https://github.com/get-set-fetch/scraper with a custom plugin based on the mozilla/readability as detailed in https://getsetfetch.org/node/custom-plugins.html (extracting news article content). I think it's a close match to your use case.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts