Scrapy
| Scrapy | RoboBrowser | |
|---|---|---|
| 198 | - | |
| 62,224 | 3,695 | |
| 1.3% | -0.1% | |
| 9.5 | 0.0 | |
| 3 days ago | almost 6 years ago | |
| Python | Python | |
| BSD 3-clause "New" or "Revised" License | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
Scrapy
-
Why everyone is talking about loop-engineering and how is it changing agentic ai workflows? Claude Code and Web Scraping examples
Think about what a mature scraping project already contains. There is a schema that every item must validate against. There are field coverage thresholds, because a run where only 60% of products have prices is a failed run no matter what the exit code says. There are expected item counts, error rate ceilings, and finish reason checks. In the Scrapy world we even have a dedicated framework for all of this, and I wrote about it earlier this year in my post on giving spidey-senses to your spiders with Spidermon. Here is the reframe that I cannot stop thinking about: a Spidermon monitor suite is a rubric. Our community spent a decade encoding "what good data looks like" into machine-checkable criteria, because silent failure is scraping's oldest enemy, the spider that runs green for three weeks while quietly shipping garbage. We built the evaluator long before we had a generator capable of acting on its feedback. Every other field adopting loop engineering has to invent its definition of done from scratch. We just have to plug ours in. The missing piece was never detection. It was what happens after detection, which until now was a human reading an alert, opening the site, sighing at the redesign, and rewriting selectors. Models like Fable 5, which Anthropic says can work autonomously far longer than any previous Claude model, are finally good enough to sit inside that gap. John Rooney saw early versions of this pattern when he built scraping agents for 30 days, and the lesson that stuck with me from his series is that agents fail not from lack of capability but from lack of structure around them. Loops are that structure.
-
How to write and publish a Python package to PyPI
This guide walks through the full process using uv, a fast, modern Python toolchain that replaces pip, virtualenv, pip-tools, twine, and build with a single tool. We will write a reusable Scrapy download handler, structure it as a proper Python package, test it, and publish it to PyPI.
-
How to tell if a page uses JavaScript rendering (and what to do about it)
In Scrapy, Zyte API integrates via the scrapy-zyte-api package:
-
How to Use rs-trafilatura with Scrapy
Scrapy is the standard Python framework for web scraping. It handles crawling, scheduling, and data pipelines. rs-trafilatura plugs into Scrapy as an item pipeline — your spider yields items with HTML, and the pipeline adds structured extraction results automatically.
-
Best Python Web Scraping Libraries 2026
Official Documentation: Scrapy Project
-
Scrapy Middlewares: A Practical Guide for Beginners (With Real-World Examples)
User-Agent: Scrapy/2.11.0 (+https://scrapy.org)
-
Progress Updates on Contribution to Scrapy
Last week, I worked on code refactoring in Scrapy, which is an essential practice in larger and more complex projects. Refactoring not only improves code maintainability but also makes it easier for other contributors to understand and extend the project. This task was a good starting point for me to verify that I had the Scrapy project correctly set up locally, as refactoring of codes should not break existing functionalities.
-
Contributing to Larger Open Source Project - Scrapy
In the past three months, I worked on various open source projects, including my own project Repo Context Packager, Math Worksheet Generator and Open Web Calendar. This month, I want to challenge myself to work on a larger and more widely used project - Scrapy, a Python module for web crawling.
-
How I Block All 26M of Your Curl Requests
What I have seen it is hard to tell what "serious scrapers" use. They use many things. Some use this, some not. This is what I have learned reading webscraping on reddit. Nobody speaks things like that out loud.
There are many tools, see links below
Personally I think that running selenium can be a bottle neck, as it does not play nice, sometimes processes break, even system sometimes requires restart because of things blocked, can be memory hog, etc. etc. That is my experience.
To be able to scale I think you have to have your own implementation. Serious scrapers complain about people using selenium, or derivatives as noobs, who will come back asking why page X does not work in scraping mechanisms.
https://github.com/lexiforest/curl_cffi
https://github.com/encode/httpx
https://github.com/scrapy/scrapy
https://github.com/apify/crawlee
- Scrapy needs to have sane defaults that do no harm
RoboBrowser
We haven't tracked posts mentioning RoboBrowser yet.
Tracking mentions began in Dec 2020.
What are some alternatives?
colly - Elegant Scraper and Crawler Framework for Golang
requests-html - Pythonic HTML Parsing for Humans™
feedparser - Parse feeds in Python
Grab - Web Scraping Framework
portia - Visual scraping for Scrapy
Pomp - Screen scraping and web crawling framework