scaling-to-distributed-crawling
Crawly
Our great sponsors
scaling-to-distributed-crawling | Crawly | |
---|---|---|
5 | 2 | |
36 | 840 | |
- | 3.2% | |
0.0 | 6.6 | |
over 2 years ago | 5 days ago | |
HTML | Elixir | |
MIT License | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
scaling-to-distributed-crawling
-
DOs and DON'Ts of Web Scraping
We published a repository and blog post about distributed crawling in Python. It is a bit more complicated than what we've seen so far. It uses external software (Celery for asynchronous task queue and Redis as the database).
- Mastering Web Scraping in Python: Scaling to Distributed Crawling - ZenRows
- Mastering Web Scraping in Python: Scaling to Distributed Crawling – ZenRows
-
Mastering Web Scraping in Python: Scaling to Distributed Crawling
We will start to separate concepts before the project grows. We already have two files: tasks.py and main.py. We will create another two to host crawler-related functions (crawler.py) and database access (repo.py). Please look at the snippet below for the repo file, it is not complete, but you get the idea. There is a GitHub repository with the final content in case you want to check it.
Crawly
- Crawly – Elixir web scraping framework
-
Squirm - This was the night of the crawling terror!
I would say that the usage of Squirm is for crawling purposes and it is a close analogue of https://github.com/elixir-crawly/crawly
What are some alternatives?
celery - Distributed Task Queue (development branch)
Crawler - A high performance web crawler / scraper in Elixir.
colly - Elegant Scraper and Crawler Framework for Golang
scrape - Scrape any website, article or RSS/Atom Feed with ease!
Scrapy - Scrapy, a fast high-level web crawling & scraping framework for Python.
gun - HTTP/1.1, HTTP/2, Websocket client (and more) for Erlang/OTP.
Redis - Redis is an in-memory database that persists on disk. The data model is key-value, but many different kind of values are supported: Strings, Lists, Sets, Sorted Sets, Hashes, Streams, HyperLogLogs, Bitmaps.
finch - Elixir HTTP client, focused on performance
newspaper - newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
yuri - Elixir module for easier URI manipulation.
PeARS-orchard - This is the development version of PeARS, the people's search engine. More compact but less robust than PeARS-lite. If you just want to use PeARS as a local indexer, use PeARS-lite instead.
mint - Functional HTTP client for Elixir with support for HTTP/1 and HTTP/2 🌱