Python Scrapy

Open-source Python projects categorized as Scrapy

Top 23 Python Scrapy Projects

  • scrapy-redis

    Redis-based components for Scrapy.

    Project mention: Ask HN: What are the best tools for web scraping in 2022? | news.ycombinator.com | 2022-08-10

    11. With some work, you can use Scrapy for distributed projects that are scraping thousands (millions) of domains. We are using https://github.com/rmax/scrapy-redis.

  • Gerapy

    Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

  • InfluxDB

    Access the most powerful time series database as a service. Ingest, store, & analyze all types of time series data in a fully-managed, purpose-built database. Keep data forever with low-cost storage and superior data compression.

  • scrapy-splash

    Scrapy+Splash for JavaScript integration

    Project mention: Scrape with Splash Requests returns empty | reddit.com/r/scrapy | 2022-06-17

    I have also modified the settings.py from according to steps 1-5 from https://github.com/scrapy-plugins/scrapy-splash

  • scrapydweb

    Web app for Scrapyd cluster management, Scrapy log analysis & visualization, Auto packaging, Timer tasks, Monitor & Alert, and Mobile UI. DEMO :point_right:

    Project mention: Best scrapydweb fork | reddit.com/r/scrapy | 2022-11-16

    It's seems like there are a lot of more recently updated forks https://github.com/my8100/scrapydweb/network

  • SpiderKeeper

    admin ui for scrapy/open source scrapinghub

  • Webscraping Open Project

    The web scraping open project repository aims to share knowledge and experiences about web scraping with Python

    Project mention: What are your thoughts on scrapy | reddit.com/r/webscraping | 2022-08-30
  • advertools

    advertools - online marketing productivity and analysis tools

    Project mention: Using Python to create large scale SEM campaigns in minutes. | reddit.com/r/Python | 2022-04-13
  • Sonar

    Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.

  • scrapyrt

    HTTP API for Scrapy spiders

    Project mention: New to python and scrapy stuff but need this project to work so that I can do my data research and stuff easily in the future. | reddit.com/r/scrapy | 2022-04-19
  • scrapy-rotating-proxies

    use multiple proxies with Scrapy

    Project mention: Scrapy rotating proxies | reddit.com/r/webscraping | 2022-08-01

    Hi, I've been using the scrapy-rotating-proxies (https://github.com/TeamHG-Memex/scrapy-rotating-proxies) library for scrapy and I'm getting logs in my crawl of type example: "[rotating_proxies.expire] DEBUG: Proxy is DEAD. When I check and test the proxies (I'm using webshare proxies) and urls mentioned on the logs individually they work ok, so I assume it's a problem with the library, has anyone had the same issue of similar problem? (I looked for tickets reported on github but had didn't find any refering to this.

  • scrapy-fake-useragent

    Random User-Agent middleware based on fake-useragent

    Project mention: Looking for suggestions for a web scraper | reddit.com/r/learnpython | 2022-09-01

    User-Agents: Your user-agent list is pretty small, and you aren't adding the other headers that real browsers typically have. For a bigger list of user-agents you could use the scrapy-fake-user-agent middleware.

  • scrapy-playwright

    🎭 Playwright integration for Scrapy

    Project mention: Scrapy & splash guide | reddit.com/r/learnpython | 2023-02-18
  • alltheplaces

    A set of spiders and scrapers to extract location information from places that post their location on the internet.

    Project mention: An aggressive, stealthy web spider operating from Microsoft IP space | news.ycombinator.com | 2023-01-18

    I recently made some contributions to https://github.com/alltheplaces/alltheplaces which is a set of Scrapy spiders for extracting locations of primarily franchise business stores, for use with projects such as OpenStreetMap. After these contributions I am now in shock at how hostile a lot of businesses are with allowing people the privilege of accessing their websites to find store locations, store information or to purchase an item.

    Some examples of what I found you can't do at perhaps 20% of business websites:

    * Use the website near a country border where people often commute freely across borders on a daily or regular basis (example: Schengen region).

    * Plan a holiday figuring out when a shop closes for the day in another country.

    * Order something from an Internet connection in another country to arrive when you return home.

    * Look up product information whilst on a work trip overseas.

    * Access the website via a workplace proxy server that is mandated to be used and just so happens to be hosted in a data centre (the likes of AWS and Azure included) which is blocked simply because it's not a residential ISP. A more complex check would be the website checking the time between a request from Javascript code executing on client reaching the server matches the time the server thinks it should have taken (by ICMP PING or whatever to the origin IP address).

    * Use the website simply because the geo IP database the company uses hasn't been updated to reflect block reassignments by the regional registry, and thinks that your French residential IP address is a defunct Latvian business IP address.

    * Find the business on a price comparison website.

    * Find the business on a search engine that isn't Google.

    * Access the website without first allowing obfuscated Javascript to execute (example: [1]).

    * Use the website if you had certain disabilities.

    * Access the website with IPv6 or using 464XLAT (shared origin IPv4 address with potentially a large pool of other users).

    The answer to me appears obvious: Store and product information is for the most part static and can be stored in static HTML and JSON files that are rewritten on a regular cycle, or at least cached until their next update. JSON files can be advertised freely to the public so if anyone was trying to obtain information off the website, they could do so with a single request for a small file causing as minimal impact as possible.

    Of course, none of the anti-bot measures implemented by websites actually stop bots they're most seeking to block. There are services specifically designed for scraping websites that have 100,000's of residential IP addresses in their pools around the world (fixed and mobile ISPs), and it's trivial with software such as selenium-stealth to make each request look legitimate from a Javascript perspective as if the request was from a Chrome browser running on Windows 10 with a single 2160p screen, etc. If you force bots down the path of working around anti-bot measures, it's a never ending battle the website will ultimately lose because the website will end up blocking legitimate customers and have extremely poor performance that is worsened by forcing bots to make 10x the number of requests to the website just to look legitimate or pass client tests.

    [1] https://www.imperva.com/products/web-application-firewall-wa...

  • estela

    estela, an elastic web scraping cluster 🕸

    Project mention: How to run webs scraping script every 15 minutes | reddit.com/r/webscraping | 2023-02-13

    You may want to check out [estela](https://estela.bitmaker.la/docs/), which is a spider management solution, developed by [Bitmaker](https://bitmaker.la) that allows you to run [Scrapy](https://scrapy.org) spiders.

  • scrapy-cloudflare-middleware

    A Scrapy middleware to bypass the CloudFlare's anti-bot protection

  • GoodreadsScraper

    Scrape data from Goodreads using Scrapy and Selenium :books:

    Project mention: [OC] Top 10 Fantasy Books of the 21st Century According to GoodReads | reddit.com/r/dataisbeautiful | 2022-05-20

    Used this code to get the data: https://github.com/havanagrawal/GoodreadsScraper

  • scrapy-crawl-once

    Scrapy middleware which allows to crawl only new content

  • open-gov-crawlers

    Parse government documents into well formed JSON

    Project mention: What are the best repos that are a display of clean code and good programming practices that I can learn from? | reddit.com/r/learnprogramming | 2022-09-06

    I get feedback occasionally that this is the cleanest web scraping code someone’s seen: https://github.com/public-law/open-gov-crawlers

  • scrapy-mysql-pipeline

    scrapy mysql pipeline

    Project mention: Scraping data from one page into two separate database tables | reddit.com/r/scrapy | 2022-12-14

    ah sorry. i'm using this mysql pipeline https://github.com/IaroslavR/scrapy-mysql-pipeline/ with item loaders

  • jarchive-clues

    Web crawler to collect Jeopardy! clues from https://j-archive.com

  • scrapingant-client-python

    ScrapingAnt API client for Python.

  • scrapeops-scrapy-sdk

    Scrapy extension that gives you all the scraping monitoring, alerting, scheduling, and data validation you will need straight out of the box.

    Project mention: Free Python Scrapy 5-Part Mini Course | reddit.com/r/scrapy | 2022-10-20

    Part 5: Deployment, Scheduling & Running Jobs - Deploying our spider on a server, and monitoring and scheduling jobs via ScrapeOps. Article

  • burplist

    Web crawler for Burplist, a search engine for craft beers in Singapore

    Project mention: Say Goodbye to Heroku Free Tier: Here Are 4 Alternatives | dev.to | 2023-02-12

    Heroku Dynos apps — Burplist was migrated to Koyeb. Having tried out other notable Heroku alternatives like Fly.io, Northflank, and Railway, I can safely say that the migration from Heroku to Koyeb required the least amount of effort and kinks. It just works in my case.

  • scrapy-proxycrawl-middleware

    Scrapy middleware interface to scrape using ProxyCrawl proxy service

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2023-02-18.

Python Scrapy related posts

Index

What are some of the best open-source Scrapy projects in Python? This list will help you:

Project Stars
1 scrapy-redis 5,248
2 Gerapy 2,918
3 scrapy-splash 2,863
4 scrapydweb 2,679
5 SpiderKeeper 2,632
6 Webscraping Open Project 1,199
7 advertools 784
8 scrapyrt 758
9 scrapy-rotating-proxies 638
10 scrapy-fake-useragent 628
11 scrapy-playwright 453
12 alltheplaces 381
13 estela 110
14 scrapy-cloudflare-middleware 90
15 GoodreadsScraper 81
16 scrapy-crawl-once 72
17 open-gov-crawlers 50
18 scrapy-mysql-pipeline 47
19 jarchive-clues 29
20 scrapingant-client-python 18
21 scrapeops-scrapy-sdk 18
22 burplist 11
23 scrapy-proxycrawl-middleware 10
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com