Announcing Crawlee Python: Now you can use Python to build reliable web crawlers

This page summarizes the projects mentioned and recommended in the original post on dev.to

SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  1. crawlee

    Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

    import asyncio from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext async def main() -> None: # Create a crawler instance crawler = PlaywrightCrawler( # headless=False, # browser_type='firefox', ) @crawler.router.default_handler async def request_handler(context: PlaywrightCrawlingContext) -> None: data = { "request_url": context.request.url, "page_url": context.page.url, "page_title": await context.page.title(), "page_content": (await context.page.content())[:10000], } await context.push_data(data) await crawler.run(["https://crawlee.dev"]) if __name__ == "__main__": asyncio.run(main())

  2. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  3. crawlee-python

    Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

    Crawlee for Python is fully open-sourced and the codebase is available on the GitHub repository of Crawlee Python.

  4. scrapy-playwright

    🎭 Playwright integration for Scrapy

    While libraries like Scrapy require additional installation of middleware, i.e, scrapy-playwright and still doesn’t work with Windows, Crawlee for Python supports a unified interface for HTTP & headless browsers.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Web Scraping Dynamic Websites With Scrapy Playwright

    1 project | dev.to | 6 Mar 2024
  • Scrapy & splash guide

    1 project | /r/learnpython | 18 Feb 2023
  • which libraries/frameworks could be used for page interaction?

    2 projects | /r/webscraping | 6 Nov 2022
  • Implementing a Selenium backend on a web app?

    1 project | /r/webscraping | 8 Oct 2022
  • Scraping Dynamic Javascript Websites with Scrapy and Scrapy-playwright

    2 projects | dev.to | 14 Jun 2022

Did you know that Python is
the 2nd most popular programming language
based on number of references?