Our great sponsors
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
def read_redditfile(file: str) -> dict: """ Iterate over the pushshift JSON lines, yielding them as Python dicts. Decompress iteratively if necessary. """ # older files in the dataset are uncompressed while newer ones use zstd compression and have .xz, .bz2, or .zst endings if not file.endswith('.bz2') and not file.endswith('.xz') and not file.endswith('.zst'): with open(file, 'r', encoding='utf-8') as infile: for line in infile: l = json.loads(line) yield(l) else: # code by Watchful1 written for the Pushshift offline dataset, found here: https://github.com/Watchful1/PushshiftDumps with open(file, 'rb') as fh: dctx = ZstdDecompressor(max_window_size=2147483648) with dctx.stream_reader(fh) as reader: previous_line = "" while True: chunk = reader.read(2**24) # 16mb chunks if not chunk: break string_data = chunk.decode('utf-8') lines = string_data.split("\n") for i, line in enumerate(lines[:-1]): if i == 0: line = previous_line + line comment = json.loads(line) yield comment previous_line = lines[-1]
I had the same problem using pushshift zreader class, probably caused by a multibyte sequence split across two chunks of decompressed data.
Related posts
- 📣 I want to debunk Reddit's claims, and talk about their unwillingness to work with developers, moderators, and the larger community, as well as say thank you for all the support
- zstd
- Reddit Will License Its Data to Train LLMs, We Made a FF Extension to Replace
- Cobalt Tools Media Downloader
- Chrome Feature: ZSTD Content-Encoding