How to scrape entire blogs with content?

This page summarizes the projects mentioned and recommended in the original post on /r/webscraping

Our great sponsors
  • SurveyJS - Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • readability

    A standalone version of the readability lib

  • If you're familiar with javascript/typescript take a look at https://github.com/mozilla/readability, it extracts article content from web pages. To automate the process you can use https://github.com/puppeteer/puppeteer to control Chrome and inject the mozilla code in each page you want to scrape.

  • puppeteer

    Node.js API for Chrome

  • If you're familiar with javascript/typescript take a look at https://github.com/mozilla/readability, it extracts article content from web pages. To automate the process you can use https://github.com/puppeteer/puppeteer to control Chrome and inject the mozilla code in each page you want to scrape.

  • SurveyJS

    Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App. With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.

    SurveyJS logo
  • scraper

    Nodejs web scraper. Contains a command line, docker container, terraform module and ansible roles for distributed cloud scraping. Supported databases: SQLite, MySQL, PostgreSQL. Supported headless clients: Puppeteer, Playwright, Cheerio, JSdom. (by get-set-fetch)

  • You can use https://github.com/get-set-fetch/scraper with a custom plugin based on the mozilla/readability as detailed in https://getsetfetch.org/node/custom-plugins.html (extracting news article content). I think it's a close match to your use case.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts