Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →
Top 23 Python Crawler Projects
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
newspaper
newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
-
Douyin_TikTok_Download_API
🚀「Douyin_TikTok_Download_API」是一个开箱即用的高性能异步抖音、快手、TikTok、Bilibili数据爬取工具,支持API调用,在线批量解析及下载。
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
-
news-please
news-please - an integrated web crawler and information extractor for news that just works
-
grab-site
The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Project mention: Scrapy: A Fast and Powerful Scraping and Web Crawling Framework | news.ycombinator.com | 2024-02-16
At the moment I am working on a web scraper for TikTok. At the moment, I am able to retrieve data about the first 16 videos from a channel. The way I achieved this was to make requests to an unofficial API https://github.com/Evil0ctal/Douyin_TikTok_Download_API. My problem is that the requirements for this project do not allow me to use any package that would extract data from TikTok. I would like to ask you all, how should I go about this task. Already tried getting data from the HTML, but is not sufficient since most of it is not displayed when I use requests.get(URL). Could you please recommend some repositories that could help or some way of extracting the data? Thank you!
Project mention: Trafilatura: Python tool to gather text on the Web | news.ycombinator.com | 2023-08-14The feature list answers that question pretty well: https://github.com/adbar/trafilatura#features
Basically: you could implement all of this on top of BeautifulSoup - polite crawling policies, sitemap and feed parsing, URL de-duplication, parallel processing, download queues, heuristics for extracting just the main article content, metadata extraction, language detection... but it would require writing an enormous amount of extra code.
Project mention: Ask HN: How can I back up an old vBulletin forum without admin access? | news.ycombinator.com | 2024-01-29The format you want is WARC. Even the Library of Congress uses it. There are many many WARC scrapers. I'd look at what the Internet Archive recommends. A quick search turned up this from the Archive Team and Jason Scott https://github.com/ArchiveTeam/grab-site (https://wiki.archiveteam.org/index.php/Who_We_Are) but I found that in less than 15 seconds of searching so do your own diligence.
botasaurus – The All in One Framework to build Awesome Scrapers
Project mention: Show HN: New AI Dataset Based on LibGen and Sci-Hub | news.ycombinator.com | 2023-09-08
Python Crawler related posts
- The Great GPT Firewall
- Looking for something like ArchiveBox but with the recursive functionality of HTTrack
- Show HN: New AI Dataset Based on LibGen and Sci-Hub
- How to make scrapy run multiple times on the same URLs?
- Web Scraper Multiparadigmático!
- struggling to download websites
-
GMaps-Crawler VS google-maps-scraper - a user suggested alternative
2 projects | 14 May 2023
-
A note from our sponsor - InfluxDB
www.influxdata.com | 25 Apr 2024
Index
What are some of the best open-source Crawler projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | Scrapy | 50,824 |
2 | pyspider | 16,319 |
3 | newspaper | 13,720 |
4 | Photon | 10,501 |
5 | Douyin_TikTok_Download_API | 6,780 |
6 | autoscraper | 5,937 |
7 | scrapy-redis | 5,451 |
8 | myGPTReader | 4,375 |
9 | ProxyBroker | 3,714 |
10 | toapi | 3,462 |
11 | weibo-crawler | 3,045 |
12 | trafilatura | 2,740 |
13 | TorBot | 2,599 |
14 | Grab | 2,354 |
15 | news-please | 1,925 |
16 | PSpider | 1,811 |
17 | OpenWPM | 1,311 |
18 | grab-site | 1,260 |
19 | mlscraper | 1,225 |
20 | XSRFProbe | 915 |
21 | botasaurus | 899 |
22 | scrapyrt | 816 |
23 | bookcorpus | 776 |
Sponsored