SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 Python Spider Projects
-
-
InfoSpider
INFO-SPIDER 是一个集众多数据源于一身的爬虫工具箱🧰,旨在安全快捷的帮助用户拿回自己的数据,工具代码开源,流程透明。支持数据源包括GitHub、QQ邮箱、网易邮箱、阿里邮箱、新浪邮箱、Hotmail邮箱、Outlook邮箱、京东、淘宝、支付宝、中国移动、中国联通、中国电信、知乎、哔哩哔哩、网易云音乐、QQ好友、QQ群、生成朋友圈相册、浏览器浏览历史、12306、博客园、CSDN博客、开源中国博客、简书。
-
InfluxDB
Access the most powerful time series database as a service. Ingest, store, & analyze all types of time series data in a fully-managed, purpose-built database. Keep data forever with low-cost storage and superior data compression.
-
-
-
scrapydweb
Web app for Scrapyd cluster management, Scrapy log analysis & visualization, Auto packaging, Timer tasks, Monitor & Alert, and Mobile UI. DEMO :point_right:
It's seems like there are a lot of more recently updated forks https://github.com/my8100/scrapydweb/network
-
-
-
Sonar
Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.
-
-
-
grab-site
The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
You can use grab-site with --no-offsite-links and --igsets=mediawiki.
-
-
freshonions-torscraper
Fresh Onions is an open source TOR spider / hidden service onion crawler hosted at zlal32teyptf4tvi.onion
-
alltheplaces
A set of spiders and scrapers to extract location information from places that post their location on the internet.
Project mention: An aggressive, stealthy web spider operating from Microsoft IP space | news.ycombinator.com | 2023-01-18I recently made some contributions to https://github.com/alltheplaces/alltheplaces which is a set of Scrapy spiders for extracting locations of primarily franchise business stores, for use with projects such as OpenStreetMap. After these contributions I am now in shock at how hostile a lot of businesses are with allowing people the privilege of accessing their websites to find store locations, store information or to purchase an item.
Some examples of what I found you can't do at perhaps 20% of business websites:
* Use the website near a country border where people often commute freely across borders on a daily or regular basis (example: Schengen region).
* Plan a holiday figuring out when a shop closes for the day in another country.
* Order something from an Internet connection in another country to arrive when you return home.
* Look up product information whilst on a work trip overseas.
* Access the website via a workplace proxy server that is mandated to be used and just so happens to be hosted in a data centre (the likes of AWS and Azure included) which is blocked simply because it's not a residential ISP. A more complex check would be the website checking the time between a request from Javascript code executing on client reaching the server matches the time the server thinks it should have taken (by ICMP PING or whatever to the origin IP address).
* Use the website simply because the geo IP database the company uses hasn't been updated to reflect block reassignments by the regional registry, and thinks that your French residential IP address is a defunct Latvian business IP address.
* Find the business on a price comparison website.
* Find the business on a search engine that isn't Google.
* Access the website without first allowing obfuscated Javascript to execute (example: [1]).
* Use the website if you had certain disabilities.
* Access the website with IPv6 or using 464XLAT (shared origin IPv4 address with potentially a large pool of other users).
The answer to me appears obvious: Store and product information is for the most part static and can be stored in static HTML and JSON files that are rewritten on a regular cycle, or at least cached until their next update. JSON files can be advertised freely to the public so if anyone was trying to obtain information off the website, they could do so with a single request for a small file causing as minimal impact as possible.
Of course, none of the anti-bot measures implemented by websites actually stop bots they're most seeking to block. There are services specifically designed for scraping websites that have 100,000's of residential IP addresses in their pools around the world (fixed and mobile ISPs), and it's trivial with software such as selenium-stealth to make each request look legitimate from a Javascript perspective as if the request was from a Chrome browser running on Windows 10 with a single 2160p screen, etc. If you force bots down the path of working around anti-bot measures, it's a never ending battle the website will ultimately lose because the website will end up blocking legitimate customers and have extremely poor performance that is worsened by forcing bots to make 10x the number of requests to the website just to look legitimate or pass client tests.
[1] https://www.imperva.com/products/web-application-firewall-wa...
-
Project mention: Show HN: Hacker News Summary – Let ChatGPT Summarize Hacker News for You | /r/patient_hackernews | 2023-06-09
-
-
This solution is using requests, you can also do this in scrapy, and if you are planning to run more crawlers you can use estela which is a spider management solution.
-
webspot
An intelligent web service to automatically detect web content and extract information from it.
-
telegram-groups-crawler
A Telegram crawler made in Python to automatically search groups and channels and collect any type of data from them (+ dataset included).
-
scrapeops-scrapy-sdk
Scrapy extension that gives you all the scraping monitoring, alerting, scheduling, and data validation you will need straight out of the box.
Part 5: Deployment, Scheduling & Running Jobs - Deploying our spider on a server, and monitoring and scheduling jobs via ScrapeOps. Article
-
-
-
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python Spider related posts
- struggling to download websites
- Internet Archive Down, will be up and running soon (i hope).
- best tool for downloading forum posts in real-time?
- Best way to back up entire website on a schedule
- An aggressive, stealthy web spider operating from Microsoft IP space
- What does Overture Map mean for the future of OpenStreetMap
- Datasette AllThePlaces
-
A note from our sponsor - #<SponsorshipServiceOld:0x00007f0920974020>
www.saashub.com | 9 Jun 2023
Index
What are some of the best open-source Spider projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | Photon | 9,686 |
2 | InfoSpider | 6,663 |
3 | toapi | 3,388 |
4 | Gerapy | 2,993 |
5 | scrapydweb | 2,715 |
6 | SpiderKeeper | 2,644 |
7 | Grab | 2,287 |
8 | TorBot | 1,983 |
9 | PSpider | 1,771 |
10 | grab-site | 1,020 |
11 | XSRFProbe | 857 |
12 | freshonions-torscraper | 447 |
13 | alltheplaces | 394 |
14 | hacker-news-digest | 324 |
15 | LinkedInDumper | 171 |
16 | estela | 117 |
17 | webspot | 55 |
18 | telegram-groups-crawler | 50 |
19 | scrapeops-scrapy-sdk | 20 |
20 | XingDumper | 17 |
21 | amazon_price_tracker | 6 |
22 | scrapy-yle-kuntavaalit2021 | 2 |
23 | python-selenium | 2 |