pywb
ArchiveBox
pywb | ArchiveBox | |
---|---|---|
7 | 248 | |
1,303 | 19,861 | |
1.2% | 1.7% | |
7.2 | 9.8 | |
15 days ago | 6 days ago | |
JavaScript | Python | |
GNU General Public License v3.0 only | MIT |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
pywb
-
Is there any good software for deduping (deduplicating) content in WARC files?
I have thousands of bookmarks on raindrop.io that I've been wanting to archive for a while. However, I've archived ~150 pages so far with Pywb and it ended up being 500MB across two WARCs, even with the dedupe setting specified in my settings file. It dedupes while archiving pages. I want software to get any spots missed and be sure that WARCs are actually deduped.
-
Is there a way to easily and reliably SSH to my laptop no matter what wifi the laptop is connected to? I have no clue.
I don't know if the solution would be related or relevant to this, but I would also want to be able to remotely launch and access a web server, Pywb, on Safari on my iPad, also no matter what wifi I'm on. On a Mac, it would be launched with the command wayback and the server would be accessed on the Browser with localhost:8080.
-
I can't install a Python package, pywb, looks like a problem with brotlipy. What can I do?
Check their github site. I would try "git clone https://github.com/webrecorder/pywb `
-
Purevolume archives?
I've been trying to open those large warc files these days. I've tried webrecorder, replayweb, pywb and warcat before but none of these worked well for me.
-
Ran grab-site now have some warc.gz files etc, the site in question was originally hosted in a mixture of html and javascript, what's the best and easiest way to make this accessible as a user for offline personal use?
pywb, but it requires creating a full copy of the data: https://github.com/webrecorder/pywb/issues/408
-
How good is ArchiveWeb.page?
I found it to be good with loading small WARCs quickly, but it can longer if the WARC is larger. Webarchive player, while it's old and discontinued, I've found it work better than Webrecorder Player and replayweb.page. If you want newer software to replay WARCs, try Pywb. I find it to be the best WARC player.
-
Saving all browsed websites automatically
I use pywb in proxy recording mode.
ArchiveBox
-
Ask HN: What Underrated Open Source Project Deserves More Recognition?
Two projects I greatly appreciate, allowing me to easily archive my bandcamp and GOG purchases (after the initial setup anyways):
https://github.com/easlice/bandcamp-downloader
https://github.com/Kalanyr/gogrepoc
And I recently learned about archivebox, which I think is going to be a fast favorite and finally let me clear out my mess of tabs/bookmarks: https://github.com/ArchiveBox/ArchiveBox
- YaCy, a distributed Web Search Engine, based on a peer-to-peer network
-
Vice website is shutting down
If you really want to save the content for yourself, use something like https://archivebox.io/
I've been running a local instance for a few years now and download/save tech articles all time. I can search and find them as needed.
-
An Introduction to the WARC File
API is coming soon (relatively, it's still a one-man project)! Stay tuned https://github.com/ArchiveBox/ArchiveBox/issues/496
I have an event-sourcing refactor in progress now to allow us to pluginize functionality like the API (similar to Home Assistant with a plugin app sotre), it will take a month or two. Next up is the REST API using the new plugin system.
-
Ask HN: How can I back up an old vBulletin forum without admin access?
I guess your best chance is to use something like https://archivebox.io/.
-
ArchiveBox – open-source self-hosted web archiving
Yeah this is a cool project but it was discussed 2 days ago.
As mentioned by the maintainer there, they even maintain a list of alternatives, very classy:
https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-...
- ArchiveBox: Open-source self-hosted web archiving
- Linkhut: A Social Bookmarking Site
- Show HN: Rem: Remember Everything (open source)
- Bookmark manager with a focus on organization?
What are some alternatives?
conifer - Collect and revisit web pages.
Wallabag - wallabag is a self hostable application for saving web pages: Save and classify articles. Read them later. Freely.
warcio - Streaming WARC/ARC library for fast web archive IO
paimon-moe - Your best Genshin Impact companion! Help you plan what to farm with ascension calculator and database. Also track your progress with todo and wish counter.
awesome-selfhosted - A list of Free Software network services and web applications which can be hosted on your own servers
SingleFile - Web Extension for saving a faithful copy of a complete web page in a single HTML file
replayweb.page - Serverless replay of web archives directly in the browser
ArchivesSpace - The ArchivesSpace archives management tool
SingleFileZ - Web Extension to save a faithful copy of an entire web page in a self-extracting ZIP file
grab-site - The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
22120 - 💾 Diskernet - Your preferred backup solution. It's like you're still online! Full text search archive from your browsing and bookmarks. Weclome! to the Diskernet: an internet on yer disk. Disconnect with Diskernet, an internet for the post-online apocalypse. Or the airplane WiFi. Or the site goes down. Or ... You get the picture. Get Diskernet. 80s logo. Formerly 22120 (project codename) ;P ;) xx;p [Moved to: https://github.com/i5ik/Diskernet]
Archivematica - Free and open-source digital preservation system designed to maintain standards-based, long-term access to collections of digital objects.