SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 Python Scrapy Projects
Redis-based components for Scrapy.Project mention: Ask HN: What are the best tools for web scraping in 2022? | news.ycombinator.com | 2022-08-10
11. With some work, you can use Scrapy for distributed projects that are scraping thousands (millions) of domains. We are using https://github.com/rmax/scrapy-redis.
Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js
Access the most powerful time series database as a service. Ingest, store, & analyze all types of time series data in a fully-managed, purpose-built database. Keep data forever with low-cost storage and superior data compression.
I have also modified the settings.py from according to steps 1-5 from https://github.com/scrapy-plugins/scrapy-splash
Web app for Scrapyd cluster management, Scrapy log analysis & visualization, Auto packaging, Timer tasks, Monitor & Alert, and Mobile UI. DEMO :point_right:Project mention: Best scrapydweb fork | reddit.com/r/scrapy | 2022-11-16
It's seems like there are a lot of more recently updated forks https://github.com/my8100/scrapydweb/network
admin ui for scrapy/open source scrapinghub
Webscraping Open Project
The web scraping open project repository aims to share knowledge and experiences about web scraping with PythonProject mention: What are your thoughts on scrapy | reddit.com/r/webscraping | 2022-08-30
advertools - online marketing productivity and analysis toolsProject mention: Using Python to create large scale SEM campaigns in minutes. | reddit.com/r/Python | 2022-04-13
Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.
HTTP API for Scrapy spidersProject mention: New to python and scrapy stuff but need this project to work so that I can do my data research and stuff easily in the future. | reddit.com/r/scrapy | 2022-04-19
use multiple proxies with ScrapyProject mention: Scrapy rotating proxies | reddit.com/r/webscraping | 2022-08-01
Hi, I've been using the scrapy-rotating-proxies (https://github.com/TeamHG-Memex/scrapy-rotating-proxies) library for scrapy and I'm getting logs in my crawl of type example: "[rotating_proxies.expire] DEBUG: Proxy is DEAD. When I check and test the proxies (I'm using webshare proxies) and urls mentioned on the logs individually they work ok, so I assume it's a problem with the library, has anyone had the same issue of similar problem? (I looked for tickets reported on github but had didn't find any refering to this.
Random User-Agent middleware based on fake-useragentProject mention: Looking for suggestions for a web scraper | reddit.com/r/learnpython | 2022-09-01
User-Agents: Your user-agent list is pretty small, and you aren't adding the other headers that real browsers typically have. For a bigger list of user-agents you could use the scrapy-fake-user-agent middleware.
🎭 Playwright integration for ScrapyProject mention: Scrapy & splash guide | reddit.com/r/learnpython | 2023-02-18
A set of spiders and scrapers to extract location information from places that post their location on the internet.Project mention: An aggressive, stealthy web spider operating from Microsoft IP space | news.ycombinator.com | 2023-01-18
I recently made some contributions to https://github.com/alltheplaces/alltheplaces which is a set of Scrapy spiders for extracting locations of primarily franchise business stores, for use with projects such as OpenStreetMap. After these contributions I am now in shock at how hostile a lot of businesses are with allowing people the privilege of accessing their websites to find store locations, store information or to purchase an item.
Some examples of what I found you can't do at perhaps 20% of business websites:
* Use the website near a country border where people often commute freely across borders on a daily or regular basis (example: Schengen region).
* Plan a holiday figuring out when a shop closes for the day in another country.
* Order something from an Internet connection in another country to arrive when you return home.
* Look up product information whilst on a work trip overseas.
* Use the website simply because the geo IP database the company uses hasn't been updated to reflect block reassignments by the regional registry, and thinks that your French residential IP address is a defunct Latvian business IP address.
* Find the business on a price comparison website.
* Find the business on a search engine that isn't Google.
* Use the website if you had certain disabilities.
* Access the website with IPv6 or using 464XLAT (shared origin IPv4 address with potentially a large pool of other users).
The answer to me appears obvious: Store and product information is for the most part static and can be stored in static HTML and JSON files that are rewritten on a regular cycle, or at least cached until their next update. JSON files can be advertised freely to the public so if anyone was trying to obtain information off the website, they could do so with a single request for a small file causing as minimal impact as possible.
estela, an elastic web scraping cluster 🕸Project mention: How to run webs scraping script every 15 minutes | reddit.com/r/webscraping | 2023-02-13
You may want to check out [estela](https://estela.bitmaker.la/docs/), which is a spider management solution, developed by [Bitmaker](https://bitmaker.la) that allows you to run [Scrapy](https://scrapy.org) spiders.
A Scrapy middleware to bypass the CloudFlare's anti-bot protection
Scrape data from Goodreads using Scrapy and Selenium :books:Project mention: [OC] Top 10 Fantasy Books of the 21st Century According to GoodReads | reddit.com/r/dataisbeautiful | 2022-05-20
Used this code to get the data: https://github.com/havanagrawal/GoodreadsScraper
Scrapy middleware which allows to crawl only new content
Parse government documents into well formed JSONProject mention: What are the best repos that are a display of clean code and good programming practices that I can learn from? | reddit.com/r/learnprogramming | 2022-09-06
I get feedback occasionally that this is the cleanest web scraping code someone’s seen: https://github.com/public-law/open-gov-crawlers
scrapy mysql pipelineProject mention: Scraping data from one page into two separate database tables | reddit.com/r/scrapy | 2022-12-14
ah sorry. i'm using this mysql pipeline https://github.com/IaroslavR/scrapy-mysql-pipeline/ with item loaders
Web crawler to collect Jeopardy! clues from https://j-archive.com
ScrapingAnt API client for Python.
Scrapy extension that gives you all the scraping monitoring, alerting, scheduling, and data validation you will need straight out of the box.Project mention: Free Python Scrapy 5-Part Mini Course | reddit.com/r/scrapy | 2022-10-20
Part 5: Deployment, Scheduling & Running Jobs - Deploying our spider on a server, and monitoring and scheduling jobs via ScrapeOps. Article
Web crawler for Burplist, a search engine for craft beers in SingaporeProject mention: Say Goodbye to Heroku Free Tier: Here Are 4 Alternatives | dev.to | 2023-02-12
Heroku Dynos apps — Burplist was migrated to Koyeb. Having tried out other notable Heroku alternatives like Fly.io, Northflank, and Railway, I can safely say that the migration from Heroku to Koyeb required the least amount of effort and kinks. It just works in my case.
Scrapy middleware interface to scrape using ProxyCrawl proxy service
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python Scrapy related posts
Scrapy & splash guide
1 project | reddit.com/r/learnpython | 18 Feb 2023
An aggressive, stealthy web spider operating from Microsoft IP space
1 project | news.ycombinator.com | 18 Jan 2023
What does Overture Map mean for the future of OpenStreetMap
1 project | news.ycombinator.com | 24 Dec 2022
Best scrapydweb fork
1 project | reddit.com/r/scrapy | 16 Nov 2022
2 projects | news.ycombinator.com | 10 Nov 2022
Free Python Scrapy 5-Part Mini Course
1 project | reddit.com/r/scrapy | 20 Oct 2022
Implementing a Selenium backend on a web app?
1 project | reddit.com/r/webscraping | 8 Oct 2022
A note from our sponsor - #<SponsorshipServiceOld:0x00007f160cb341c8>
www.saashub.com | 22 Mar 2023
What are some of the best open-source Scrapy projects in Python? This list will help you:
|6||Webscraping Open Project||1,199|