How to Download All of Wikipedia onto a USB Flash Drive

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • zim-tools

    Various ZIM command line tools

  • It looks like Kiwix uses the ZIM file format, which appears to have diffing support [0] (see zimdiff and zimpatch). That said, it doesn't look like Kiwix actually publishes those diffs.

    [0] https://github.com/openzim/zim-tools/tree/master/src

  • wiktextract

    Wiktionary dump file parser and multilingual data extractor

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • CDPedia

    CDPedia is a project to make the Wikipedia accesable offline

  • awesome-web-archiving

    An Awesome List for getting started with web archiving

  • Not related to the OP topic or zim but I was looking into archiving my bookmarks and other content like documentation sites and wikis. I'll list some of the things I ended up using.

    ArchiveBox[1]: Pretty much a self-hosted wayback machine. It can save websites as plain html, screenshot, text, and some other formats. I have my bookmarks archived in it and have a bookmarklet to easily add new websites to it. If you use the docker-compose you can enable a full-text search backend for an easy search setup.

    WebRecorder[2]: A browser extension that creates WACZ archives directly in the browser capturing exactly what content you load. I use it on sites with annoying dynamic content that sites like wayback and ArchiveBox wouldn't be able to copy.

    ReplayWeb[3]: An interface to browse archive types like WARC, WACZ, and HAR. The interface is just like browsing through your browser. It can be self-hosted as well for the full offline experience.

    browsertrix-crawler[4]: A CLI tool to scrape websites and output to WACZ. Its super easy to run with Docker and I use it to scrape entire blogs and docs for offline use. It uses Chrome to load webpages and has some extra features like custom browser profiles, interactive login, and autoscroll/autoplay. I use the `--generateWACZ` parameter so I can use ReplayWeb to easily browse through the final output.

    For bookmark and misc webpage archiving then ArchiveBox should be more than enough. Check out this repo for an amazing list of tools and resources https://github.com/iipc/awesome-web-archiving

    [1] https://github.com/ArchiveBox/ArchiveBox

  • replayweb.page

    Serverless replay of web archives directly in the browser

  • browsertrix-crawler

    Run a high-fidelity browser-based crawler in a single Docker container

  • ZIMply

    An easy to use offline reader for ZIM files right in your browser!

  • I think there are better ways to open ZIM files. I've had massive trouble with Kiwix. The old version seems broke beyond repair and the new version is too heavy.

    ZIMply on branch `version2` has worked pretty well for me [1]. The search works a lot better and it's really nicely formatted.

    [1] https://github.com/kimbauters/ZIMply/tree/version2

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts