-
crawlee
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
import asyncio from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext async def main() -> None: # Create a crawler instance crawler = PlaywrightCrawler( # headless=False, # browser_type='firefox', ) @crawler.router.default_handler async def request_handler(context: PlaywrightCrawlingContext) -> None: data = { "request_url": context.request.url, "page_url": context.page.url, "page_title": await context.page.title(), "page_content": (await context.page.content())[:10000], } await context.push_data(data) await crawler.run(["https://crawlee.dev"]) if __name__ == "__main__": asyncio.run(main())
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
crawlee-python
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
Crawlee for Python is fully open-sourced and the codebase is available on the GitHub repository of Crawlee Python.
-
While libraries like Scrapy require additional installation of middleware, i.e, scrapy-playwright and still doesn’t work with Windows, Crawlee for Python supports a unified interface for HTTP & headless browsers.