Our great sponsors
-
SurveyJS
Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App. With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.
-
scraper
Nodejs web scraper. Contains a command line, docker container, terraform module and ansible roles for distributed cloud scraping. Supported databases: SQLite, MySQL, PostgreSQL. Supported headless clients: Puppeteer, Playwright, Cheerio, JSdom. (by get-set-fetch)
If you're familiar with javascript/typescript take a look at https://github.com/mozilla/readability, it extracts article content from web pages. To automate the process you can use https://github.com/puppeteer/puppeteer to control Chrome and inject the mozilla code in each page you want to scrape.
If you're familiar with javascript/typescript take a look at https://github.com/mozilla/readability, it extracts article content from web pages. To automate the process you can use https://github.com/puppeteer/puppeteer to control Chrome and inject the mozilla code in each page you want to scrape.
You can use https://github.com/get-set-fetch/scraper with a custom plugin based on the mozilla/readability as detailed in https://getsetfetch.org/node/custom-plugins.html (extracting news article content). I think it's a close match to your use case.
Related posts
- Show HN: Quetta – A privacy-first web browser with enhanced ad blocker inside
- How To Enable Hardware Acceleration on Chrome, Chromium & Puppeteer on AWS in Headless mode
- The 5 Node.js PDF Libraries Every Developer Must Know
- A question about web-scraping
- Can a puppetry major survive a flagship’s financial crisis? Should it?