The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →
Top 23 Web Crawling Open-Source Projects
-
Project mention: Scrapy: A Fast and Powerful Scraping and Web Crawling Framework | news.ycombinator.com | 2024-02-16
-
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
-
crawlee
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
Project mention: Automating Data Collection with Apify: From Script to Deployment | dev.to | 2024-03-17Previously, the Apify SDK offered a blend of crawling functionalities and Actor building features. However, a recent update separated these functionalities into two distinct libraries: Crawlee and Apify SDK v3. Crawlee now houses the web scraping and crawling tools, while Apify SDK v3 focuses solely on features specific to building Actors for the Apify platform. This distinction allows for a clear separation of concerns and enhances the development experience for various use cases.
-
-
-
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
MechanicalSoup is a Python library for web scraping that combines the simplicity of Requests with the convenience of BeautifulSoup. It's particularly useful for interacting with web forms, like login pages. Here's a basic example to illustrate how you can use MechanicalSoup for web scraping:
-
-
-
-
-
-
-
Project mention: RSS can be used to distribute all sorts of information | news.ycombinator.com | 2023-11-20
There is JSON Feed¹ already. One of the spec writers is behind micro.blog, which is the first place I saw it(and also one of the few places I've seen it). I don't think it is a bad idea, and it doesn't take all that long to implement it.
I have long hoped it would pick up with the JSON-ify everything crowd, just so I'd never see a non-Atom feed again. We perhaps wouldn't need sooo much of the magic that is wrapped up in packages like feedparser² to deal with all the brokeness of RSS in the wild then.
-
-
-
-
-
FastImage
FastImage finds the size or type of an image given its uri by fetching as little as needed
-
Wombat
Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.
-
MetaInspector
Ruby gem for web scraping purposes. It scrapes a given URL, and returns you its title, meta description, meta keywords, links, images...
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Web Crawling related posts
- Launching Crawlee Blog: Your Node.js resource hub for web scraping and automation.
- How to scrape a website with Python (Beginner tutorial)
- Scrapy: A Fast and Powerful Scraping and Web Crawling Framework
- Seven Python Projects to Elevate Your Coding Skills
- What is SERP? Meaning, Use Cases and Approaches
- Anything like scrapy in other languages?
- RSS can be used to distribute all sorts of information
-
A note from our sponsor - WorkOS
workos.com | 29 Mar 2024
Index
What are some of the best open-source Web Crawling projects? This list will help you:
Project | Stars | |
---|---|---|
1 | Scrapy | 50,603 |
2 | pyspider | 16,273 |
3 | requests-html | 13,553 |
4 | crawlee | 11,796 |
5 | webmagic | 11,203 |
6 | jsoup | 10,565 |
7 | portia | 9,142 |
8 | MechanicalSoup | 4,533 |
9 | Crawler4j | 4,469 |
10 | Mechanize | 4,349 |
11 | RoboBrowser | 3,689 |
12 | Apache Nutch | 2,792 |
13 | Grab | 2,350 |
14 | gain | 2,031 |
15 | Scrapely | 1,845 |
16 | feedparser | 1,813 |
17 | PSpider | 1,811 |
18 | anemone | 1,617 |
19 | Upton | 1,614 |
20 | cola | 1,485 |
21 | FastImage | 1,351 |
22 | Wombat | 1,301 |
23 | MetaInspector | 1,025 |