-
crawlee
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
Language: Node.js, Python | GitHub: 15.4K+ stars | link
-
InfluxDB
InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
-
Language: Python | GitHub: 52.9k stars | link
-
Language: Python | GitHub: 4.7K+ stars | link
-
Language: Node.js | GitHub: 6.7K+ stars | link
-
Language: Multi-language | GitHub: 30.6K stars | link
-
heritrix3
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
Language: Java | GitHub: 2.8K+ stars | link
-
Language: Java | GitHub: 2.9K+ stars | link
-
Stream
Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
-
Language: Java | GitHub: 11.4K+ stars | link
-
Language: Ruby | GitHub: 6.1K+ stars | link
-
Language: Java | GitHub: 4.5K+ stars | link
-
web-harvest
Web data extraction (web data mining, web scraping) tool. It leverages well proved XML and text processing techologies in order to easily extract useful data from arbitrary web pages. Fork of http://web-harvest.sourceforge.net
Language: Go | GitHub: 11.1k | link
Related posts
-
Scrapy needs to have sane defaults that do no harm
-
12 tips on how to think like a web scraping expert
-
Current problems and mistakes of web scraping in Python and tricks to solve them!
-
Scrapy, a fast high-level web crawling and scraping framework for Python
-
Automate Spider Creation in Scrapy with Jinja2 and JSON