scrapy-playwright
ArchiveBox
scrapy-playwright | ArchiveBox | |
---|---|---|
14 | 266 | |
1,077 | 22,970 | |
3.9% | 2.3% | |
7.9 | 9.9 | |
2 months ago | 6 days ago | |
Python | Python | |
BSD 3-clause "New" or "Revised" License | MIT |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
scrapy-playwright
-
Current problems and mistakes of web scraping in Python and tricks to solve them!
Middleware libraries are written by the community and are extending their functionality. For example, scrapy-playwright.
-
Announcing Crawlee Python: Now you can use Python to build reliable web crawlers
While libraries like Scrapy require additional installation of middleware, i.e, scrapy-playwright and still doesn’t work with Windows, Crawlee for Python supports a unified interface for HTTP & headless browsers.
-
Scrapy Vs. Crawlee
Scrapy does not support headless browsers natively, but it supports them with its plugin system, similarly it does not support scraping JavaScript rendered websites, but the plugin system makes this possible. One of the best examples is its Playwright plugin.
-
Web Scraping Dynamic Websites With Scrapy Playwright
scrapy-playwright is an integration between Scrapy and Playwright. It enables scraping dynamic web pages with Scrapy by processing the web scraping requests using a Playwright instance.
- Turning webpages into pdf
- Scrapy & splash guide
-
Web scraping with Python
To integrate Playwright with Scrapy, we will use the scrapy-playwright library. Then, we will scrape https://www.mintmobile.com/product/google-pixel-7-pro-bundle/ to demonstrate how to extract data from a website using Playwright and Scrapy.
-
which libraries/frameworks could be used for page interaction?
Scrapy-playwright
-
Implementing a Selenium backend on a web app?
your website is a dynamic there is many integration on scrappy can help you This the best best one https://github.com/scrapy-plugins/scrapy-playwright
-
Is Selenium still a good choice?
This concern should be lifted if you are a Scrapy lover. There is a Scrapy integration for playwright, that gives you a lot of freedom and lets you operate from a Scrapy spider.
ArchiveBox
-
Ask HN: How Do You Bookmark?
2. Drop the link into my instance of ArchiveBox [0] and will return to it a few weeks/months later or, more often than not, never again
[0] https://archivebox.io/
-
Is stuff online worth saving?
I use https://github.com/gildas-lormeau/SingleFile
I set it to tolerate longer processing times, and to open the file after saving so I can sanity check that it got everything. Works great at faithfully saving a page with images as it appears in browser, and saves so much time.
You might also have a look at https://github.com/ArchiveBox/ArchiveBox
-
Ask HN: How to remove Ads from a downloaded HTML file to output an ad free file?
After taking a break and stepping away for a bit, I realized that I was recreating an archiving system for websites and that there are existing solutions that do the same thing.
I found https://github.com/ArchiveBox/ArchiveBox/ which is a self hosted web archiving system. It covers most of my usecases (and I can extend it for additional functionality) so I am going to set this up and try it out.
Thanks all for the help.
-
Internet Archive breached again through stolen access tokens
Is anyone using ArchiveBox regularly? It's a self-hosted archiving solution. Not the ambitious decentralized system I think this comment is thinking of but a practical way for someone to run an archive for themselves. https://archivebox.io/
- ArchiveBox is evolving: the future of self-hosted internet archives
- Tell HN: The Wayback Machine is up, in read-only mode
- Web Archiving Projects
- I have 2000 old VHS tapes in my garage and I don't know what to do with them
- To preserve their work journalists take archiving into their own hands
-
Why are so many books listed as "Borrow Unavailable" at the Internet Archive?
And one nice tool for scraping archives for yourself is https://archivebox.io/
a nice frontend by https://news.ycombinator.com/user?id=nikisweeting
What are some alternatives?
scrapy-splash - Scrapy+Splash for JavaScript integration
Wallabag - wallabag is a self hostable application for saving web pages: Save and classify articles. Read them later. Freely.
scrapy-cloudflare-middleware - A Scrapy middleware to bypass the CloudFlare's anti-bot protection
ArchivesSpace - ArchivesSpace, the archives management tool
Scrapy - Scrapy, a fast high-level web crawling & scraping framework for Python.
linkwarden - ⚡️⚡️⚡️Self-hosted collaborative bookmark manager to collect, organize, and preserve webpages, articles, and documents.
scrapy-fake-useragent - Random User-Agent middleware based on fake-useragent
Archivematica - Free and open-source digital preservation system designed to maintain standards-based, long-term access to collections of digital objects.
scrapy-rotating-proxies - use multiple proxies with Scrapy
Access to Memory (AtoM) - Open-source, web application for archival description and public access.
aiopath - 📁 Asynchronous pathlib for Python
SingleFile - Web Extension for saving a faithful copy of a complete web page in a single HTML file