The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more β
Top 23 Crawling Open-Source Projects
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
newspaper
newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
-
crawlee
CrawleeβA web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
hakrawler
Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application
-
cariddi
Take a list of domains, crawl urls and scan for endpoints, secrets, api keys, file extensions, tokens and more
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Project mention: Scrapy: A Fast and Powerful Scraping and Web Crawling Framework | news.ycombinator.com | 2024-02-16
SerpApi focuses on scraping search results. That's why we need extra help to scrape individual sites. We'll use GoColly package.
In this guide, we'll be extracting information from Amazon product pages using the power of TypeScript in combination with the Cheerio and Crawlee libraries. We'll explore how to retrieve and extract detailed product data such as titles, prices, image URLs, and more from Amazon's vast marketplace. We'll also discuss handling potential blocking issues that may arise during the scraping process.
I have tried the following. 1. Login to Okta via browser programatically using go-rod. Which I managed to do so successfully, but I'm failing to load up Slack as it's stuck in the browser loader screen for Slack. 2. I tried to authenticate via Okta RESTful API. So far, I have managed to authenticate using {{domain}}/api/v1/authn, and then subsequently using MFA via the verify endpoint {{domain}}/api/v1/authn/factors/{{factorID}}/verify which returns me a sessionToken. From here, I can successfully create a sessionCookie which have proven quite useless to me. Perhaps I am doing it wrongly.
PuppeteerSharp
botasaurus β The All in One Framework to build Awesome Scrapers
Crawling related posts
- How To Scrape TikTok in 2024
- Scrapy: A Fast and Powerful Scraping and Web Crawling Framework
- Modern automated data miner (scrapper)
- An Introduction to the WARC File
- New webcrawler for bug-hunters and data-miners
- π "WebPalm: Unleash Websites" π
- Real challenge for web scrapers- Need Help!
-
A note from our sponsor - WorkOS
workos.com | 23 Apr 2024
Index
What are some of the best open-source Crawling projects? This list will help you:
Project | Stars | |
---|---|---|
1 | Scrapy | 50,824 |
2 | colly | 22,120 |
3 | newspaper | 13,703 |
4 | crawlee | 12,044 |
5 | awesome-web-scraping | 6,299 |
6 | Ferret | 5,616 |
7 | rod | 4,750 |
8 | hakrawler | 4,225 |
9 | PuppeteerSharp | 3,149 |
10 | Apache Nutch | 2,807 |
11 | Grab | 2,353 |
12 | awesome-puppeteer | 2,318 |
13 | cariddi | 1,351 |
14 | core | 1,319 |
15 | mlscraper | 1,219 |
16 | botasaurus | 870 |
17 | Crawly | 839 |
18 | scrapyrt | 816 |
19 | Dataflow kit | 636 |
20 | isp-data-pollution | 566 |
21 | spidermon | 510 |
22 | WarcDB | 383 |
23 | LinkedInDumper | 362 |
Sponsored