Powerful and free scraper with a headless browser under the hood and Readability for parsing

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

scrapper

1 103 8.4 Python

Web scraper with a simple REST API living in Docker and using a Headless browser and Readability.js for parsing.
trafilatura

13 2,778 8.7 Python

Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments

I've been playing with Trafilatura lately, and it's very good. There are a few very thorough comparisons to other projects and it really shines. It doesn't do anything headless from what I can tell, but it doesn't have to do the scraping itself. Maybe an option could be to use Playwright to scrape, then Trafilatura to parse. Food for thought.

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

How does Firefox's Reader View work?
15 projects | news.ycombinator.com | 30 Mar 2022
Trafilatura: Python tool to gather text on the Web
3 projects | news.ycombinator.com | 14 Aug 2023
Show HN: Build AI Dags with Memory; Run and Validate LLM Tools in Containers
2 projects | news.ycombinator.com | 21 Apr 2023
I made a Chrome Extension that lets you ask any question about the page you are on (bluf.ai)
2 projects | /r/SideProject | 6 Mar 2023
Gathering News Headlines
1 project | /r/Automate | 8 Aug 2022

Powerful and free scraper with a headless browser under the hood and Readability for parsing

This page summarizes the projects mentioned and recommended in the original post on /r/webscraping
Crawler web-scraping Readability Web Content Extracting Docker
Post date: 18 Mar 2023

scrapper

trafilatura

InfluxDB

Related posts

Powerful and free scraper with a headless browser under the hood and Readability for parsing

This page summarizes the projects mentioned and recommended in the original post on /r/webscraping Crawler web-scraping Readability Web Content Extracting Docker Post date: 18 Mar 2023

scrapper

trafilatura

InfluxDB

Related posts

This page summarizes the projects mentioned and recommended in the original post on /r/webscraping
Crawler web-scraping Readability Web Content Extracting Docker
Post date: 18 Mar 2023