PushshiftDumps
Sketchpad
PushshiftDumps | Sketchpad | |
---|---|---|
40 | 42 | |
240 | 112 | |
- | - | |
8.1 | 3.0 | |
8 days ago | 7 months ago | |
Python | Python | |
MIT License | - |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
PushshiftDumps
-
Pushshift Dumps Help: Only getting submissions, that are named comments
I am trying to get comments and submissions from specific subreddits. So far, I've run the u/watchful1 script combine_folder_mutipleprocess.py and have been able process a few files.
-
Create and Search In Your Own Reddit Database
FYI, you can use my filter_file.py script to directly extract out submissions with a certain title. There's a place you can put in a file with a list of keywords to filter on if you have a lot of them. Or it would be fairly easy to modify to use a regex. There are also steps listed to export the list of submission ids and then filter a comments file to only comments from those submissions. You can also export directly to CSV, though you would want to use zst files for any intermediate steps. Let me know if anything in there doesn't work.
-
Reddit starting to bring back deleted comments.
This repo has good examples of scripts to use them, https://github.com/Watchful1/PushshiftDumps
-
Encountered a non-utf8 character
def read_redditfile(file: str) -> dict: """ Iterate over the pushshift JSON lines, yielding them as Python dicts. Decompress iteratively if necessary. """ # older files in the dataset are uncompressed while newer ones use zstd compression and have .xz, .bz2, or .zst endings if not file.endswith('.bz2') and not file.endswith('.xz') and not file.endswith('.zst'): with open(file, 'r', encoding='utf-8') as infile: for line in infile: l = json.loads(line) yield(l) else: # code by Watchful1 written for the Pushshift offline dataset, found here: https://github.com/Watchful1/PushshiftDumps with open(file, 'rb') as fh: dctx = ZstdDecompressor(max_window_size=2147483648) with dctx.stream_reader(fh) as reader: previous_line = "" while True: chunk = reader.read(2**24) # 16mb chunks if not chunk: break string_data = chunk.decode('utf-8') lines = string_data.split("\n") for i, line in enumerate(lines[:-1]): if i == 0: line = previous_line + line comment = json.loads(line) yield comment previous_line = lines[-1]
-
What to do after decompressing the files from academic torrents?
Just look a folder down in the github repo https://github.com/Watchful1/PushshiftDumps/tree/master/scripts the scripts are still there.
-
What are you using to browse/self host downloaded reddit?
I am working with the ZST files downloaded from Pushshift and sorted into subreddits by the lovely u/watchful1 here. ZST is too compressed to browse on its own but using scripts like this one you can process them into readable NDJSON files. From there im not sure what to do. I would like to have a self hosted reddit-clone that i can import these dumps into and browse freely.
-
Tell HN: My Reddit account was banned after adding my subs to the protest
The whole reddit (posts and comments separately) from 2005-06 until 2022-12 is on this [1] torrent link, it's very easy to download, extract and use the data [2]. I'm writing my thesis about the connection between the reddit post's type and the comment structure, and I've been working with this data, for a few months, it's amazing.
[1] https://academictorrents.com/details/7c0645c94321311bb05bd87...
[2] https://github.com/Watchful1/PushshiftDumps
-
Reddit, API calls, and AI - Who does your knowledge belong to?
Sure! You can download the compressed data from this torrent, then you can use this project if you want to just decompress and process the data.
-
Script to find overlapping users between subreddits from dump files
You can go through the process outlined in that thread to download the subreddit's you're interested in, then add them at the top of the new script, run it and it will output the list of overlapping users. It will actually likely be faster than the old script even counting download times for the dumps since the api was so slow. Though you are limited to the available 20k subreddits.
-
This Reddit Community Has Been Archived
how I read the file? First I got tried to extrat the file ok I got it, but them I text file I can't read that., I saw a few people saing it was just a json file I tried with a json reader but it say the json data is invalid, them I tried this program but nothing happens no new file is created or something, here a print, maybe I'm doing something wrong but I don't know because the script don't have any instruction how to use it!
Sketchpad
-
I'm scared of loosing this safe space (and other trans subreddits) in the face of API changes and possible hate flood that will come after
Also I'm gonna try and download all of the trans subreddits using this script https://github.com/Watchful1/Sketchpad/blob/master/postDownloader.py. hopefully I can get it working tomorrow.
-
Reddit’s plan to kill third-party apps sparks widespread protests
Looks like there are also some unofficial, faster ways. But I don't know if they work: https://github.com/Watchful1/Sketchpad/blob/master/postDownloader.py
-
Script to find overlapping users between subreddits from dump files
A while back I wrote a fairly popular script that used the pushshift api to find overlapping users between subreddits. This doesn't work anymore since the api is down, so I threw together an updated script that does the same thing using the subreddit dump files.
-
PRAW - getting ONLY top comments of a single specific thread efficiently
If you actually just want to level comments I have an example here https://github.com/Watchful1/Sketchpad/blob/master/load_top_level.py
- Late Night Random Discussion Thread - 04 April, 2023
-
Help with search and count results script of reddit API
I have a script here that lets you download a specific subreddit or users entire history using pushshift. It's a good example of how the url works and how to iterate through results based on timestamp. You can add a q=keyword parameter to filter to only submissions/comments matching a specific keyword. And you could remove the subreddit parameter if you want data from all of reddit.
-
Are the more comments objects directive, or random?
I have an old script I wrote a long time ago to fetch only the top level comments in a thread here, which isn't quite what you're trying to do but should be a good example.
- Getting more than 1000 threads.
-
Separate dump files for the top 20k subreddits
In addition to the dump files, pushshift offers an API with powerful filtering options. The main limitation is that it takes quite some time to download a substantial amount of data. If you have a use case that doesn't cleanly align to specific subreddits, take a look at my api download script here. Again I'm happy to work with you to build something for a specific use case.
-
Looking for advice on how to identify users based on unique combinations of subreddit activity
There was a script posted at https://github.com/Watchful1/Sketchpad/blob/master/overlapCounter.py This does exactly what I need by using the pushift api, but seems too slow to work as a web app, and also I have no idea where to begin in converting the script to a web app.
What are some alternatives?
Pushshift-Importer
Pushshift API - Pushshift API
RedditLemmyImporter - 🔥 Anti-Reddit Aktion 🔥
qBittorrent - qBittorrent BitTorrent client
zreader - Read compressed NDJSON .zst files easily
7-Zip-zstd - 7-Zip with support for Brotli, Fast-LZMA2, Lizard, LZ4, LZ5 and Zstandard
reddit-project-public
Lemmy - 🐀 A link aggregator and forum for the fediverse
redarc - Reddit archiver
Nuitka - Nuitka is a Python compiler written in Python. It's fully compatible with Python 2.6, 2.7, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 3.10, and 3.11. You feed it your Python app, it does a lot of clever things, and spits out an executable or extension module.
RedditScrape - Quick and dirty script to suck down the pr0n from Reddit before it's too late