-
Playwright
Playwright is a framework for Web Testing and Automation. It allows testing Chromium, Firefox and WebKit with a single API.
For simple scraping where the content is fairly static, or when performance is critical, I will use linkedom to process pages.
https://github.com/WebReflection/linkedom
When the content is complex or involves clicking, Playwright is probably the best tool for the job.
https://github.com/microsoft/playwright
-
CodeRabbit
CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
-
estela is an elastic web scraping cluster running on Kubernetes. It provides mechanisms to deploy, run and scale web scraping spiders via a REST API and a web interface.
It is a modern alternative to the few OSS projects available for such needs, like scrapyd and gerapy. estela aims to help web scraping teams and individuals that are considering moving away from proprietary scraping clouds, or who are in the process of designing their on-premise scraping architecture, so as not to needlessly reinvent the wheel, and to benefit from the get-go from features such as built-in scalability and elasticity, among others.
estela has been recently published as OSS under the MIT license:
https://github.com/bitmakerla/estela
More details about it can be found in the release blog post and the official documentation:
https://bitmaker.la/blog/2022/06/24/estela-oss-release.html
https://estela.bitmaker.la/docs/
estela supports Scrapy spiders for the moment being, but additional frameworks/languages are on the roadmap.
All kinds of feedback and contributions are welcome!
Disclaimer: I'm part of the development team behind estela :-)
-
-
curl-impersonate[1] is a curl fork that I maintain and which lets you fetch sites while impersonating a browser. Unfortunately, the practice of TLS and HTTP fingerprinting of web clients has become extremely common in the past ~1 year, which means a regular curl request will often return some JS challenge and not the real content. curl-impersonate helps with that.
[1] https://github.com/lwthiker/curl-impersonate
-
The polite package using R is intended to be a friendly way of scraping content from the owner. "The three pillars of a polite session are seeking permission, taking slowly and never asking twice."
https://github.com/dmi3kno/polite
-
For a particular type of scraping, we wrote SSScraper on top of Colly and it works really well:
https://github.com/gotripod/ssscraper/
-
Beautiful Soup gets the job done. I made several app by using it.
[1] https://github.com/altilunium/wistalk (Scrap wikipedia to analyze user's activity)
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
[2] https://github.com/altilunium/psedex (Scrap goverment website to get list of all registered online services in Indonesia)
-
-
[4] https://github.com/altilunium/wi-page (Scrap wikipedia to get most active contributors that contribute to a certain article)
-
-
it depends. for no-code solution, please check [powerpage-web-crawler](https://github.com/casualwriter/powerpage-web-crawler) for crawling blog/posts.
-
undetected-chromedriver
Custom Selenium Chromedriver | Zero-Config | Passes ALL bot mitigation systems (like Distil / Imperva/ Datadadome / CloudFlare IUAM)
-
-
For simple scraping where the content is fairly static, or when performance is critical, I will use linkedom to process pages.
https://github.com/WebReflection/linkedom
When the content is complex or involves clicking, Playwright is probably the best tool for the job.
https://github.com/microsoft/playwright
-
-
browserless
Deploy headless browsers in Docker. Run on our cloud or bring your own. Free for non-commercial uses.
-
If the content you need is static, I like using node + cheerio [0] as the selector syntax is quite powerful. If there is some javascript execution involved however, I will fall back to puppeteer.
[0] - https://cheerio.js.org/
-
Unpopular opinion, but Bash/Shell Scripting. Seriously, it's probably the fastest way to get things done. For fetching, use cURL. Want to extract particular markup? Use pup[1]. Want to process csv? Use cskit[2]. Or JSON? Use jq[3]. Want to use DB? Use psql. Once you get the hang of shell scripting, you can create a scrapers by just wiring up these utils.
The only thing I wish was present was better support for RegExes. Bash and most unix tools don't support PCRE which can severely limiting. Plus, sometimes you want to process text as a whole vs line-by-line.
I would also recommend Python's sh[4] module is Shell scripting isn't your cup of tea. You get best of both worlds. Ease of use with Bash utils, and a saner syntax.
[1]: https://github.com/ericchiang/pup
-
-
I’m not sure about “best” but I’ve been using Colly (written in Go) and it’s been pretty slick. Haven’t run in to anything it can’t do.
http://go-colly.org/
-
My main qualms with bash as a scripting language are that its syntax is not only kind of bonkers (no judgement, I know it's an old tool) but also just crazily unsafe. I link to a few high-profile things whenever people ask me why my mantra is "the time to switch your script from bash to python is when you want to delete things".
>rm -rf /usr /lib/nvidia-current/xorg/xorg
https://github.com/MrMEEE/bumblebee-Old-and-abbandoned/commi...
>rm -rf "$STEAMROOT/"*
https://github.com/valvesoftware/steam-for-linux/issues/3671
It's just too easy to shoot your foot.
-
My main qualms with bash as a scripting language are that its syntax is not only kind of bonkers (no judgement, I know it's an old tool) but also just crazily unsafe. I link to a few high-profile things whenever people ask me why my mantra is "the time to switch your script from bash to python is when you want to delete things".
>rm -rf /usr /lib/nvidia-current/xorg/xorg
https://github.com/MrMEEE/bumblebee-Old-and-abbandoned/commi...
>rm -rf "$STEAMROOT/"*
https://github.com/valvesoftware/steam-for-linux/issues/3671
It's just too easy to shoot your foot.
-
8. If you decide to have your own infrastructure, you can use https://github.com/scrapy/scrapyd.
-
8. If you decide to have your own infrastructure, you can use https://github.com/scrapy/scrapyd.
-
11. With some work, you can use Scrapy for distributed projects that are scraping thousands (millions) of domains. We are using https://github.com/rmax/scrapy-redis.
-
crawlee
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
I'm working on a personal project that involves A LOT of scraping, and through several iterations I've gotten some stuff that works quite well. Here's a quick summary of what I've explored (both paid and free):
* Apify (https://apify.com/) is a great, comprehensive system if you need to get fairly low-level. Everything is hosted there, they've got their own proxy service (or you can roll your own), and their open source framework (https://github.com/apify/crawlee) is excellent.
* I've also experimented with running both their SDK (crawlee) and Playwright directly on Google Cloud Run, and that also works well and is an order-of-magnitude less expensive than running directly on their platform.
* Bright Data nee Luminati is excellent for cheap data center proxies ($0.65/GB pay as you go), but prices get several orders of magnitude more expensive if you need anything more thorough than data center proxies.
* For some direct API crawls that I do, all of the scraping stuff is unnecessary and I just ping the APIs directly.
* If the site you're scraping is using any sort of anti-bot protection, I've found that ScrapingBee (https://www.scrapingbee.com/) is by far the easiest solution. I spent many many hours fighting anti-bot protection doing it myself with some combination of Bright Data, Apify and Playwright, and in the end I kinda stopped battling and just decided to let ScrapingBee deal with it for me. I may be lucky in that the sites I'm scraping don't really use JS heavily, so the plain vanilla, no-JS ScrapingBee service works almost all of the time for those. Otherwise it can get quite expensive if you need JS rendering, premium proxies, etc. But a big thumbs up to them for making it really easy.
Always looking for new techniques and tools, so I'll monitor this thread closely.
-
We've built https://serpapi.com
We've invented the industry what you referring as "data type specific APIs"; APIs that abstract away all proxies issues, captcha solvings, various layouts support, even scrapping-related legal issues, and much more to a clean JSON response every single call. It was a lot of work but our success rate and response times are now rivaling non-scraping commercial APIs: https://serpapi.com/status
I think the next battle will be still legal despite all the wins in favor of scrapping public pages and common sense understanding this is the way to go. The EFF has been doing an amazing work in this world and we are proud to be a significant yearly contributor to the EFF.
-
Webscraping Open Project
Discontinued The web scraping open project repository aims to share knowledge and experiences about web scraping with Python [Moved to: https://github.com/TheWebScrapingClub/webscraping-from-0-to-hero]
I’m collecting my experience in using these tools in this “web scraping open knowledge project” on github (https://github.com/reanalytics-databoutique/webscraping-open...) and on my substack (http://thewebscraping.club/) for longer free content
-
I’m collecting my experience in using these tools in this “web scraping open knowledge project” on github (https://github.com/reanalytics-databoutique/webscraping-open...) and on my substack (http://thewebscraping.club/) for longer free content
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Related posts
-
Could someone recommend me a library for c# like one of these two (they are for python) : mlscraper and autoscraper
-
Golang with Colly: Use Random Fake User-Agents When Scraping
-
How to scrape Google Maps data using Python and Crawlee
-
How to scrape Google search results with Python
-
11 best open-source web crawlers and scrapers in 2024