- scaling-to-distributed-crawling VS celery
- scaling-to-distributed-crawling VS colly
- scaling-to-distributed-crawling VS Scrapy
- scaling-to-distributed-crawling VS Redis
- scaling-to-distributed-crawling VS newspaper
- scaling-to-distributed-crawling VS PeARS-orchard
- scaling-to-distributed-crawling VS storm-crawler
- scaling-to-distributed-crawling VS Crawly
- scaling-to-distributed-crawling VS Angular
Scaling-to-distributed-crawling Alternatives
Similar projects and alternatives to scaling-to-distributed-crawling
-
-
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
Redis
Redis is an in-memory database that persists on disk. The data model is key-value, but many different kind of values are supported: Strings, Lists, Sets, Sorted Sets, Hashes, Streams, HyperLogLogs, Bitmaps.
-
Scrapy
Scrapy, a fast high-level web crawling & scraping framework for Python.
-
newspaper
newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
-
PeARS-orchard
This is the development version of PeARS, the people's search engine. More compact but less robust than PeARS-lite. If you just want to use PeARS as a local indexer, use PeARS-lite instead.
-
storm-crawler
A scalable, mature and versatile web crawler based on Apache Storm
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
-
Crawly
Crawly, a high-level web crawling & scraping framework for Elixir.
scaling-to-distributed-crawling reviews and mentions
-
DOs and DON'Ts of Web Scraping
We published a repository and blog post about distributed crawling in Python. It is a bit more complicated than what we've seen so far. It uses external software (Celery for asynchronous task queue and Redis as the database).
-
Mastering Web Scraping in Python: Scaling to Distributed Crawling
We will start to separate concepts before the project grows. We already have two files: tasks.py and main.py. We will create another two to host crawler-related functions (crawler.py) and database access (repo.py). Please look at the snippet below for the repo file, it is not complete, but you get the idea. There is a GitHub repository with the final content in case you want to check it.
Stats
ZenRows/scaling-to-distributed-crawling is an open source project licensed under MIT License which is an OSI approved license.
The primary programming language of scaling-to-distributed-crawling is HTML.