-
crawlee
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
The main advantage (for now) is that the library has a single interface for both HTTP and headless browsers, and bundled auto scaling. You can write your crawlers using the same base abstraction, and the framework takes care of this heavy lifting. Developers of scrapers shouldn't need to reinvent the wheel, and just focus on building the "business" logic of their scrapers. Having said that, if you wrote your own crawling library, the motivation to use Crawlee might be lower, and that's fair enough.
Please note that this is the first release, and we'll keep adding many more features as we go, including anti-blocking, adaptive crawling, etc. To see where this might go, check https://github.com/apify/crawlee
-
CodeRabbit
CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
-
crawlee-python
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
Hey all,
This is Jan, the founder of [Apify](https://apify.com/)—a full-stack web scraping platform. After the success of [Crawlee for JavaScript](https://github.com/apify/crawlee/) and the demand from the Python community, we're launching [Crawlee for Python](https://github.com/apify/crawlee-python) today!
The main features are:
- A unified programming interface for both HTTP (HTTPX with BeautifulSoup) & headless browser crawling (Playwright)
-
> Crawlee isn’t any less configurable than Scrapy.
Oh, then I have obviously overlooked how one would be able to determined if a proxy has been blocked and evict it from the pool <https://github.com/rejoiceinhope/scrapy-proxy-pool/blob/b833...> . Or how to use an HTTP cache independent of the "browser" cache (e.g. to allow short-circuiting the actual request if I can prove it is not stale for my needs, which enables recrawls to fix logic bugs or even downloading the actual request-response payloads for making better tests) https://docs.scrapy.org/en/2.11/topics/downloader-middleware...
Unless you meant what I said about "pip install -e && echo glhf" in which case, yes, it's a "simple matter of programming" into a framework that was not designed to be extended
-
I have been running my project with selenium for some time.
Now I am using crawlee. Thanks. I will work on it, to better integrate into my project, however I already can tell it works flawlessly.
My project, with crawlee: https://github.com/rumca-js/Django-link-archive