Separate dump files for the top 20k subreddits

This page summarizes the projects mentioned and recommended in the original post on /r/pushshift

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • 7-Zip-zstd

    7-Zip with support for Brotli, Fast-LZMA2, Lizard, LZ4, LZ5 and Zstandard

  • You can extract the files yourself with 7Zip. You can install 7Zip from here and then install this plugin to extract ZStandard files, or you can directly install the modified 7Zip with the plugin already from that plugin page. Then simply open the zst file you downloaded with 7Zip and extract it.

  • qBittorrent

    qBittorrent BitTorrent client

  • This is a torrent. If you are not familiar, torrents are a way to share large files like these without having to pay hundreds of dollars in server hosting costs. They are peer to peer, which means as you download, you're also uploading the files on to other people. To do this, you can't just click a download button in your browser, you have to download a type of program called a torrent client. There are many different torrent clients, but I recommend a simple, open source one called qBittorrent.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • PushshiftDumps

    Example scripts for the pushshift dump files

  • As an alternative, if you want to save the data in a different format or extract out lines matching specific filters (keyword searching, or dates, etc), you can use a python script like the examples I have here. This lets you iterate through each comment/submission in the file without having to extract the whole thing.

  • Sketchpad

  • In addition to the dump files, pushshift offers an API with powerful filtering options. The main limitation is that it takes quite some time to download a substantial amount of data. If you have a use case that doesn't cleanly align to specific subreddits, take a look at my api download script here. Again I'm happy to work with you to build something for a specific use case.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts