scrapy-playwright
ArchiveBox
scrapy-playwright | ArchiveBox | |
---|---|---|
11 | 248 | |
837 | 19,861 | |
3.1% | 1.7% | |
7.8 | 9.8 | |
3 months ago | 3 days ago | |
Python | Python | |
BSD 3-clause "New" or "Revised" License | MIT |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
scrapy-playwright
-
Web Scraping Dynamic Websites With Scrapy Playwright
scrapy-playwright is an integration between Scrapy and Playwright. It enables scraping dynamic web pages with Scrapy by processing the web scraping requests using a Playwright instance.
- Turning webpages into pdf
- Scrapy & splash guide
-
Web scraping with Python
To integrate Playwright with Scrapy, we will use the scrapy-playwright library. Then, we will scrape https://www.mintmobile.com/product/google-pixel-7-pro-bundle/ to demonstrate how to extract data from a website using Playwright and Scrapy.
-
which libraries/frameworks could be used for page interaction?
Scrapy-playwright
-
Implementing a Selenium backend on a web app?
your website is a dynamic there is many integration on scrappy can help you This the best best one https://github.com/scrapy-plugins/scrapy-playwright
-
Is Selenium still a good choice?
This concern should be lifted if you are a Scrapy lover. There is a Scrapy integration for playwright, that gives you a lot of freedom and lets you operate from a Scrapy spider.
-
Scraping Dynamic Javascript Websites with Scrapy and Scrapy-playwright
Now we need to modify scrapy's settings to allow it to work with playwright. Instructions can be found on playwright's GitHub page. We need to add settings for DOWNLOAD_HANDLERS and TWISTED_REACTOR. New settings that were added can be found between ###. This is what the settings file should look like:
-
Web Scraping with Python: Everything you need to know
You can use something like scrapy-playwright[0] to run a headless browser framework as your download handler. I think there are versions for some of the other headless systems, if you prefer those.
[0] https://github.com/scrapy-plugins/scrapy-playwright
-
Make an addition to scrapy_playwright source code
[1]: https://github.com/scrapy-plugins/scrapy-playwright/issues/61
ArchiveBox
-
Ask HN: What Underrated Open Source Project Deserves More Recognition?
Two projects I greatly appreciate, allowing me to easily archive my bandcamp and GOG purchases (after the initial setup anyways):
https://github.com/easlice/bandcamp-downloader
https://github.com/Kalanyr/gogrepoc
And I recently learned about archivebox, which I think is going to be a fast favorite and finally let me clear out my mess of tabs/bookmarks: https://github.com/ArchiveBox/ArchiveBox
- YaCy, a distributed Web Search Engine, based on a peer-to-peer network
-
Vice website is shutting down
If you really want to save the content for yourself, use something like https://archivebox.io/
I've been running a local instance for a few years now and download/save tech articles all time. I can search and find them as needed.
-
An Introduction to the WARC File
API is coming soon (relatively, it's still a one-man project)! Stay tuned https://github.com/ArchiveBox/ArchiveBox/issues/496
I have an event-sourcing refactor in progress now to allow us to pluginize functionality like the API (similar to Home Assistant with a plugin app sotre), it will take a month or two. Next up is the REST API using the new plugin system.
-
Ask HN: How can I back up an old vBulletin forum without admin access?
I guess your best chance is to use something like https://archivebox.io/.
-
ArchiveBox – open-source self-hosted web archiving
Yeah this is a cool project but it was discussed 2 days ago.
As mentioned by the maintainer there, they even maintain a list of alternatives, very classy:
https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-...
- ArchiveBox: Open-source self-hosted web archiving
- Linkhut: A Social Bookmarking Site
- Show HN: Rem: Remember Everything (open source)
- Bookmark manager with a focus on organization?
What are some alternatives?
scrapy-splash - Scrapy+Splash for JavaScript integration
Wallabag - wallabag is a self hostable application for saving web pages: Save and classify articles. Read them later. Freely.
scrapy-cloudflare-middleware - A Scrapy middleware to bypass the CloudFlare's anti-bot protection
paimon-moe - Your best Genshin Impact companion! Help you plan what to farm with ascension calculator and database. Also track your progress with todo and wish counter.
Scrapy - Scrapy, a fast high-level web crawling & scraping framework for Python.
SingleFile - Web Extension for saving a faithful copy of a complete web page in a single HTML file
scrapy-rotating-proxies - use multiple proxies with Scrapy
ArchivesSpace - The ArchivesSpace archives management tool
scrapy-fake-useragent - Random User-Agent middleware based on fake-useragent
grab-site - The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
aiopath - 📁 Asynchronous pathlib for Python
Archivematica - Free and open-source digital preservation system designed to maintain standards-based, long-term access to collections of digital objects.