SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 Scraper Open-Source Projects
-
Project mention: Create agents that monitor and act on your behalf | news.ycombinator.com | 2024-03-24
-
Used libraries Axios - it is a promise HTTP clients to make requests to the specified URL. Cheerio- it is a library for parsing and manipulating HTML that is commonly used here for extracting data from downloaded HTML content. Apify SDK- it is for building Apify Actors, that is utilized for initializing actor environments, getting input data, and pushing extracted data to the dataset.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
Not a fix, but I tend to use lux when downloading from bilibili. It is faster too.
-
SerpApi focuses on scraping search results. That's why we need extra help to scrape individual sites. We'll use GoColly package.
-
newspaper
newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
-
crawlee
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
Project mention: Automating Data Collection with Apify: From Script to Deployment | dev.to | 2024-03-17Previously, the Apify SDK offered a blend of crawling functionalities and Actor building features. However, a recent update separated these functionalities into two distinct libraries: Crawlee and Apify SDK v3. Crawlee now houses the web scraping and crawling tools, while Apify SDK v3 focuses solely on features specific to building Actors for the Apify platform. This distinction allows for a clear separation of concerns and enhances the development experience for various use cases.
-
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
Project mention: More than 400 start.me OSINT websites! More than 10KB of sources! | /r/OSINT | 2023-04-11
-
Project mention: What are the best tools for web scraping and analysis of natural language to populate a dataset? | /r/datasets | 2023-04-12
See if something like autoscraper or mlscraper suits your needs.
-
Douyin_TikTok_Download_API
🚀「Douyin_TikTok_Download_API」是一个开箱即用的高性能异步抖音、快手、TikTok、Bilibili数据爬取工具,支持API调用,在线批量解析及下载。
At the moment I am working on a web scraper for TikTok. At the moment, I am able to retrieve data about the first 16 videos from a channel. The way I achieved this was to make requests to an unofficial API https://github.com/Evil0ctal/Douyin_TikTok_Download_API. My problem is that the requirements for this project do not allow me to use any package that would extract data from TikTok. I would like to ask you all, how should I go about this task. Already tried getting data from the HTML, but is not sufficient since most of it is not displayed when I use requests.get(URL). Could you please recommend some repositories that could help or some way of extracting the data? Thank you!
-
-
I have tried the following. 1. Login to Okta via browser programatically using go-rod. Which I managed to do so successfully, but I'm failing to load up Slack as it's stuck in the browser loader screen for Slack. 2. I tried to authenticate via Okta RESTful API. So far, I have managed to authenticate using {{domain}}/api/v1/authn, and then subsequently using MFA via the verify endpoint {{domain}}/api/v1/authn/factors/{{factorID}}/verify which returns me a sessionToken. From here, I can successfully create a sessionCookie which have proven quite useless to me. Perhaps I am doing it wrongly.
-
Project mention: GitHub - madawei2699/myGPTReader: myGPTReader is a bot on Slack that can read and summarize any webpage, documents including ebooks, or even videos from YouTube. It can communicate with you through voice. (a Python project) | /r/Python | 2023-03-30
-
ytdl-core for streaming from Youtube
-
Here's what I'm trying to use: https://github.com/JustAnotherArchivist/snscrapeWhat do I need to open/run any of this? My goal with this is to extract my follower list off Twitter, and I'd very much like to know how to run it on my machine instead of having someone run it for me on theirs. I can't even figure out what I need to open the Readme file.
-
https://github.com/IonicaBizau/scrape-it for simple code to target dom elements with classes/ids etc...
-
browser-fingerprinting
Analysis of Bot Protection systems with available countermeasures 🚿. How to defeat anti-bot system 👻 and get around browser fingerprinting scripts 🕵️♂️ when scraping the web?
Project mention: A site that tracks the price of a Big Mac in every US McDonald's | news.ycombinator.com | 2024-01-13Yes, there is a lot written about it. Here is one link I have saved:
-
-
Automatic-Udemy-Course-Enroller-GET-PAID-UDEMY-COURSES-for-FREE
Do you want to LEARN NEW STUFF for FREE? Don't worry, with the power of web-scraping and automation, this script will find the necessary Udemy coupons & enroll you for PAID UDEMY COURSES, ABSOLUTELY FREE!
-
Project mention: Show HN: I scraped 25M Shopify products to build a search engine | news.ycombinator.com | 2023-12-13
As someone who has scraped millions of items myself, I had success using Geziyor (https://github.com/geziyor/geziyor) built in Go. Shopify sites are especially easy to scrape because they tend to share the same product data formatting and don't hide it behind JS rendering.
-
freeDictionaryAPI
There was no free Dictionary API on the web when I wanted one for my friend, so I created one.
I see that the request to dictionaryapi.dev comes directly from the client. Looking at the github for that project, I see this:
-
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Scraper related posts
- Create agents that monitor and act on your behalf
- Show HN: Fitter – configurable open-source scraper
- Scraping the full snippet from Google search result
- Help regarding workflow
- Can someone walk me through this?
- What’s the coolest things you’ve done with python?
- Show HN: Flyscrape – A standalone and scriptable web scraper in Go
-
A note from our sponsor - SaaSHub
www.saashub.com | 28 Mar 2024
Index
What are some of the best open-source Scraper projects? This list will help you:
Project | Stars | |
---|---|---|
1 | Huginn | 40,891 |
2 | cheerio | 27,609 |
3 | lux | 23,959 |
4 | colly | 21,939 |
5 | newspaper | 13,625 |
6 | crawlee | 11,743 |
7 | chinese-xinhua | 10,605 |
8 | awesome-crawler | 6,023 |
9 | autoscraper | 5,868 |
10 | Douyin_TikTok_Download_API | 5,785 |
11 | Ferret | 5,599 |
12 | rod | 4,650 |
13 | myGPTReader | 4,363 |
14 | node-ytdl-core | 4,232 |
15 | snscrape | 4,185 |
16 | scrape-it | 3,975 |
17 | browser-fingerprinting | 3,830 |
18 | Emby.Plugins.JavScraper | 3,091 |
19 | Automatic-Udemy-Course-Enroller-GET-PAID-UDEMY-COURSES-for-FREE | 3,026 |
20 | Geziyor | 2,464 |
21 | freeDictionaryAPI | 2,296 |
22 | google-play-scraper | 2,192 |
23 | bulk-downloader-for-reddit | 2,181 |