The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →
Top 23 Scraping Open-Source Projects
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
crawlee
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
-
undetected-chromedriver
Custom Selenium Chromedriver | Zero-Config | Passes ALL bot mitigation systems (like Distil / Imperva/ Datadadome / CloudFlare IUAM)
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
Data-science
Collection of useful data science topics along with articles, videos, and code (by khuyentran1401)
-
Automatic-Udemy-Course-Enroller-GET-PAID-UDEMY-COURSES-for-FREE
Do you want to LEARN NEW STUFF for FREE? Don't worry, with the power of web-scraping and automation, this script will find the necessary Udemy coupons & enroll you for PAID UDEMY COURSES, ABSOLUTELY FREE!
-
trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Project mention: Scrapy: A Fast and Powerful Scraping and Web Crawling Framework | news.ycombinator.com | 2024-02-16
SerpApi focuses on scraping search results. That's why we need extra help to scrape individual sites. We'll use GoColly package.
In this guide, we'll be extracting information from Amazon product pages using the power of TypeScript in combination with the Cheerio and Crawlee libraries. We'll explore how to retrieve and extract detailed product data such as titles, prices, image URLs, and more from Amazon's vast marketplace. We'll also discuss handling potential blocking issues that may arise during the scraping process.
This command-line tool clicks ads for a certain query on Google/Bing search using undetected_chromedriver package. Supports proxy, running multiple simultaneous browsers, ad targeting/exclusion, and running in loop.
Use some library with WebDriver API (f.ex. Symfony Panther if you want to stick to PHP, or Playwright) to run tests against your WP backend
Project mention: Trafilatura: Python tool to gather text on the Web | news.ycombinator.com | 2023-08-14The feature list answers that question pretty well: https://github.com/adbar/trafilatura#features
Basically: you could implement all of this on top of BeautifulSoup - polite crawling policies, sitemap and feed parsing, URL de-duplication, parallel processing, download queues, heuristics for extracting just the main article content, metadata extraction, language detection... but it would require writing an enormous amount of extra code.
Project mention: Osint update of the Snoop Project tool search for user by nickname | news.ycombinator.com | 2024-01-02
Project mention: Show HN: I scraped 25M Shopify products to build a search engine | news.ycombinator.com | 2023-12-13As someone who has scraped millions of items myself, I had success using Geziyor (https://github.com/geziyor/geziyor) built in Go. Shopify sites are especially easy to scrape because they tend to share the same product data formatting and don't hide it behind JS rendering.
Afaik on Facebook there are no such APIs, only good old HTML parsing, check out this project for example https://github.com/kevinzg/facebook-scraper (most of the parsing code is here https://github.com/kevinzg/facebook-scraper/blob/master/facebook_scraper/extractors.py )
Scraping related posts
- 2024-03-01 listening in on the neighborhood
- Scrapy: A Fast and Powerful Scraping and Web Crawling Framework
- Connect OpenAI with external APIs with Function calling
- A command-line utility for taking automated screenshots of websites
- Direction Of The Stock Market
- Show HN: Flyscrape – A standalone and scriptable web scraper in Go
- What are you going to do the day wi-fi/data shuts off?
-
A note from our sponsor - WorkOS
workos.com | 24 Apr 2024
Index
What are some of the best open-source Scraping projects? This list will help you:
Project | Stars | |
---|---|---|
1 | Scrapy | 50,824 |
2 | colly | 22,120 |
3 | requests-html | 13,575 |
4 | crawlee | 12,044 |
5 | webmagic | 11,232 |
6 | undetected-chromedriver | 8,018 |
7 | tabula | 6,511 |
8 | awesome-web-scraping | 6,308 |
9 | autoscraper | 5,937 |
10 | Ferret | 5,616 |
11 | Mechanize | 4,354 |
12 | Data-science | 3,950 |
13 | fake-useragent | 3,459 |
14 | Automatic-Udemy-Course-Enroller-GET-PAID-UDEMY-COURSES-for-FREE | 3,048 |
15 | Symfony Panther | 2,882 |
16 | trafilatura | 2,740 |
17 | snoop | 2,683 |
18 | Geziyor | 2,475 |
19 | Grab | 2,354 |
20 | awesome-puppeteer | 2,318 |
21 | facebook-scraper | 2,177 |
22 | DiDOM | 2,173 |
23 | headless-chromium-php | 2,148 |
Sponsored