Top 6 webcrawling Open-Source Projects

heritrix3

6 2,709 6.6 Java

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

Project mention: WARC'in the Crawler | news.ycombinator.com | 2023-12-21

If anyone's interested in web crawling technology, check out Heretrix [1], been around since 2004 and while not the most performant it has incorporated many responsible disciplines in the design and as this article pointed out, WARC format.
1. https://heritrix.readthedocs.io

scrapyrt

3 818 6.8 Python

HTTP API for Scrapy spiders
Scout Monitoring

www.scoutapm.com featured

Free Django app performance insights with Scout Monitoring. Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.
ralger

1 152 1.5 R

ralger makes it easy to scrape a website. Built on the shoulders of titans: rvest, xml2.
newspaperjs

1 71 0.0 HTML

News extraction and scraping. Article Parsing
courlan

0 70 7.5 Python

Clean, filter and sample URLs to optimize data collection – includes spam, content type and language filters
proxy_web_crawler

3 41 7.5 Python

Automates the process of repeatedly searching for a website via scraped proxy IP and search keywords

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

webcrawling related posts

WARC'in the Crawler

1 project | news.ycombinator.com | 21 Dec 2023
Why isn't the internet more fun and weird?

2 projects | news.ycombinator.com | 6 Jul 2022
Is there a way to archive groups of webpages similarly to how web archive does it?

2 projects | /r/DataHoarder | 29 Mar 2022
Heritrix: Internet Archive's extensible, web-scale, archival-quality web crawler

1 project | news.ycombinator.com | 26 Sep 2021
Best Http client for web scraping

1 project | /r/java | 26 Sep 2021
Selenium using rotating proxy?

1 project | /r/learnpython | 6 Jul 2021
The Internet Is Rotting

3 projects | news.ycombinator.com | 30 Jun 2021
A note from our sponsor - InfluxDB
www.influxdata.com | 2 Jun 2024

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source webcrawling projects? This list will help you:

	Project	Stars
1	heritrix3	2,709
2	scrapyrt	818
3	ralger	152
4	newspaperjs	71
5	courlan	70
6	proxy_web_crawler	41

webcrawling

Top 6 webcrawling Open-Source Projects

heritrix3

scrapyrt

Scout Monitoring

ralger

newspaperjs

courlan

proxy_web_crawler

webcrawling related posts

WARC'in the Crawler

Why isn't the internet more fun and weird?

Is there a way to archive groups of webpages similarly to how web archive does it?

Heritrix: Internet Archive's extensible, web-scale, archival-quality web crawler

Best Http client for web scraping

Selenium using rotating proxy?

The Internet Is Rotting

Index