Collect
grab-site
Our great sponsors
Collect | grab-site | |
---|---|---|
1 | 25 | |
66 | 982 | |
- | 1.8% | |
0.0 | 7.1 | |
about 1 month ago | 4 months ago | |
TypeScript | Python | |
MIT License | GNU General Public License v3.0 or later |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
Collect
-
Looking for open source software to scrape webpages but also make them searchable with a webui. (locally hosted)
I created Collect a few years ago and still use it today.
grab-site
-
How are you archiving websites you visit?
After a lot of searching for a similar topic, this is a tool I found which works pretty well: https://github.com/ArchiveTeam/grab-site
-
Help building or mirroring docs.microsoft.com
Crawling is of course the other option. I've seen https://github.com/ArchiveTeam/grab-site in the wiki, but I'm unsure how to host the resulting .warc archives.
-
How to mirror multiple websites correctly?
It's a completely different tool, but I like using grab-site https://github.com/archiveteam/grab-site . Try --wpull-args=--span-hosts='' or something to make it mirror all subdomains. It outputs in WARC format which can be read with a site like https://replayweb.page.
-
Stack Overflow Developer Story Data Dump (10 whole MB !)
Thusly, as a bit of a statement, here's your "I will do it myself even if I have to bash my head against the wall" collection of the Developer Story on 10-20 top users. I know there are some blogs on old web design, perhaps it might be worth their while as a memento of an era bygone. And as for myself, I am looking into setting up a dedicated server for either grab-site or ArchiveBox. Possibly both!
-
Need Local Website Archiver Recommendation
https://github.com/ArchiveTeam/grab-site is easy to use and records in the WARC container format.
-
How to scrape an entire website/all of its content?
take a look at grab-site by ArchiveTeam, it's a very powerful tool for mirroring websites.
- How to archive a website that's shutting down soon
-
I have a list of reddit posts I want to save on my harddrive. Whats the easiest way?
Try using a tool such as grab-site. https://github.com/archiveteam/grab-site
-
How to save/copy/archive a website that is going to be closed down?
Thanks for pointing that out! You led me to https://github.com/ArchiveTeam/grab-site which makes it so easy to grab a site by myself.
-
How to download simple wikipedia
Oh I would definitely recommend you to use grab-site (to download the site) and then use Replay.Web (The application not the website!) to access that site because its almost as if you have Internet with a working connection
What are some alternatives?
ArchiveBox - 🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
browsertrix-crawler - Run a high-fidelity browser-based crawler in a single Docker container
docker-swag - Nginx webserver and reverse proxy with php support and a built-in Certbot (Let's Encrypt) client. It also contains fail2ban for intrusion prevention.
awesome-datahoarding - List of data-hoarding related tools
wpull - Wget-compatible web downloader and crawler.
replayweb.page - Serverless Web Archive Replay directly in the browser
docker-templates
win32 - Public mirror for win32-pr
bitextor - Bitextor generates translation memories from multilingual websites
collect - ODK Collect is an Android app for filling out forms. It's been used to collect billions of data points in challenging environments around the world. Contribute and make the world a better place! ✨📋✨