Our great sponsors
|about 1 month ago||4 months ago|
|MIT License||GNU General Public License v3.0 or later|
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
Looking for open source software to scrape webpages but also make them searchable with a webui. (locally hosted)
4 projects | reddit.com/r/DataHoarder | 16 Jan 2021
I created Collect a few years ago and still use it today.
How are you archiving websites you visit?
6 projects | reddit.com/r/selfhosted | 7 Jan 2023
After a lot of searching for a similar topic, this is a tool I found which works pretty well: https://github.com/ArchiveTeam/grab-site
Help building or mirroring docs.microsoft.com
2 projects | reddit.com/r/DataHoarder | 9 Aug 2022
Crawling is of course the other option. I've seen https://github.com/ArchiveTeam/grab-site in the wiki, but I'm unsure how to host the resulting .warc archives.
How to mirror multiple websites correctly?
2 projects | reddit.com/r/DataHoarder | 12 May 2022
It's a completely different tool, but I like using grab-site https://github.com/archiveteam/grab-site . Try --wpull-args=--span-hosts='' or something to make it mirror all subdomains. It outputs in WARC format which can be read with a site like https://replayweb.page.
Stack Overflow Developer Story Data Dump (10 whole MB !)
2 projects | reddit.com/r/DataHoarder | 1 Apr 2022
Thusly, as a bit of a statement, here's your "I will do it myself even if I have to bash my head against the wall" collection of the Developer Story on 10-20 top users. I know there are some blogs on old web design, perhaps it might be worth their while as a memento of an era bygone. And as for myself, I am looking into setting up a dedicated server for either grab-site or ArchiveBox. Possibly both!
Need Local Website Archiver Recommendation
2 projects | reddit.com/r/DataHoarder | 1 Mar 2022
https://github.com/ArchiveTeam/grab-site is easy to use and records in the WARC container format.
How to scrape an entire website/all of its content?
2 projects | reddit.com/r/Piracy | 7 Feb 2022
take a look at grab-site by ArchiveTeam, it's a very powerful tool for mirroring websites.
How to archive a website that's shutting down soon
2 projects | news.ycombinator.com | 9 Jan 2022
I have a list of reddit posts I want to save on my harddrive. Whats the easiest way?
2 projects | reddit.com/r/DataHoarder | 6 Jan 2022
Try using a tool such as grab-site. https://github.com/archiveteam/grab-site
How to save/copy/archive a website that is going to be closed down?
2 projects | reddit.com/r/DataHoarder | 29 Aug 2021
Thanks for pointing that out! You led me to https://github.com/ArchiveTeam/grab-site which makes it so easy to grab a site by myself.
How to download simple wikipedia
2 projects | reddit.com/r/selfhosted | 17 Jul 2021
Oh I would definitely recommend you to use grab-site (to download the site) and then use Replay.Web (The application not the website!) to access that site because its almost as if you have Internet with a working connection
What are some alternatives?
ArchiveBox - 🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
browsertrix-crawler - Run a high-fidelity browser-based crawler in a single Docker container
docker-swag - Nginx webserver and reverse proxy with php support and a built-in Certbot (Let's Encrypt) client. It also contains fail2ban for intrusion prevention.
awesome-datahoarding - List of data-hoarding related tools
wpull - Wget-compatible web downloader and crawler.
replayweb.page - Serverless Web Archive Replay directly in the browser
win32 - Public mirror for win32-pr
bitextor - Bitextor generates translation memories from multilingual websites
collect - ODK Collect is an Android app for filling out forms. It's been used to collect billions of data points in challenging environments around the world. Contribute and make the world a better place! ✨📋✨