Our great sponsors
-
SeleniumBase
📊 Python's all-in-one framework for web crawling, scraping, testing, and reporting. Supports pytest. UC Mode provides stealth. Includes many tools.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
You want a proper html 5 parser that can handle non valid documents. And the fastest one is https://github.com/kovidgoyal/html5-parser over 30x faster than html5lib
In those cases you might want to check out SeleniumBase: https://seleniumbase.io/
Playwright for Python has really good documentation: https://playwright.dev/python/
I used it for my https://shot-scraper.datasette.io/ tool, and wrote a bit about CLI-driven scraping using that tool here: https://simonwillison.net/2022/Mar/14/scraping-web-pages-sho...
Playwright for Python has really good documentation: https://playwright.dev/python/
I used it for my https://shot-scraper.datasette.io/ tool, and wrote a bit about CLI-driven scraping using that tool here: https://simonwillison.net/2022/Mar/14/scraping-web-pages-sho...
> Does anyone know if there as a good equivalent for Go
Yes: https://github.com/anaskhan96/soup
It works well.