SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 Web Crawling Open-Source Projects
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
crawlee
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
FastImage
FastImage finds the size or type of an image given its uri by fetching as little as needed
-
Wombat
Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.
-
MetaInspector
Ruby gem for web scraping purposes. It scrapes a given URL, and returns you its title, meta description, meta keywords, links, images...
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Project mention: Scrapy: A Fast and Powerful Scraping and Web Crawling Framework | news.ycombinator.com | 2024-02-16
In this guide, we'll be extracting information from Amazon product pages using the power of TypeScript in combination with the Cheerio and Crawlee libraries. We'll explore how to retrieve and extract detailed product data such as titles, prices, image URLs, and more from Amazon's vast marketplace. We'll also discuss handling potential blocking issues that may arise during the scraping process.
MechanicalSoup is a Python library for web scraping that combines the simplicity of Requests with the convenience of BeautifulSoup. It's particularly useful for interacting with web forms, like login pages. Here's a basic example to illustrate how you can use MechanicalSoup for web scraping:
Project mention: RSS can be used to distribute all sorts of information | news.ycombinator.com | 2023-11-20There is JSON Feed¹ already. One of the spec writers is behind micro.blog, which is the first place I saw it(and also one of the few places I've seen it). I don't think it is a bad idea, and it doesn't take all that long to implement it.
I have long hoped it would pick up with the JSON-ify everything crowd, just so I'd never see a non-Atom feed again. We perhaps wouldn't need sooo much of the magic that is wrapped up in packages like feedparser² to deal with all the brokeness of RSS in the wild then.
¹ https://www.jsonfeed.org/
² https://github.com/kurtmckee/feedparser
Web Crawling related posts
- Launching Crawlee Blog: Your Node.js resource hub for web scraping and automation.
- How to scrape a website with Python (Beginner tutorial)
- Scrapy: A Fast and Powerful Scraping and Web Crawling Framework
- Seven Python Projects to Elevate Your Coding Skills
- What is SERP? Meaning, Use Cases and Approaches
- Anything like scrapy in other languages?
- RSS can be used to distribute all sorts of information
-
A note from our sponsor - SaaSHub
www.saashub.com | 28 Apr 2024
Index
What are some of the best open-source Web Crawling projects? This list will help you:
Project | Stars | |
---|---|---|
1 | Scrapy | 50,896 |
2 | pyspider | 16,319 |
3 | requests-html | 13,575 |
4 | crawlee | 12,129 |
5 | webmagic | 11,239 |
6 | jsoup | 10,625 |
7 | portia | 9,166 |
8 | MechanicalSoup | 4,552 |
9 | Crawler4j | 4,469 |
10 | Mechanize | 4,354 |
11 | RoboBrowser | 3,689 |
12 | Apache Nutch | 2,809 |
13 | Grab | 2,354 |
14 | gain | 2,031 |
15 | Scrapely | 1,846 |
16 | feedparser | 1,829 |
17 | PSpider | 1,811 |
18 | Upton | 1,614 |
19 | anemone | 1,613 |
20 | cola | 1,488 |
21 | FastImage | 1,355 |
22 | Wombat | 1,303 |
23 | MetaInspector | 1,021 |
Sponsored