How to scrape entire blogs with content?

Our great sponsors

SurveyJS - Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

Our great sponsors

readability

51 8,056 6.3 JavaScript

A standalone version of the readability lib

If you're familiar with javascript/typescript take a look at https://github.com/mozilla/readability, it extracts article content from web pages. To automate the process you can use https://github.com/puppeteer/puppeteer to control Chrome and inject the mozilla code in each page you want to scrape.

puppeteer

359 86,704 9.9 TypeScript

Node.js API for Chrome

If you're familiar with javascript/typescript take a look at https://github.com/mozilla/readability, it extracts article content from web pages. To automate the process you can use https://github.com/puppeteer/puppeteer to control Chrome and inject the mozilla code in each page you want to scrape.

SurveyJS

surveyjs.io sponsored

Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App. With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.
scraper

12 98 0.0 TypeScript

Nodejs web scraper. Contains a command line, docker container, terraform module and ansible roles for distributed cloud scraping. Supported databases: SQLite, MySQL, PostgreSQL. Supported headless clients: Puppeteer, Playwright, Cheerio, JSdom. (by get-set-fetch)

You can use https://github.com/get-set-fetch/scraper with a custom plugin based on the mozilla/readability as detailed in https://getsetfetch.org/node/custom-plugins.html (extracting news article content). I think it's a close match to your use case.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project