Analyzing multi-gigabyte JSON files locally

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Stream - Scalable APIs for Chat, Feeds, Moderation, & Video.
Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
getstream.io
featured
InfluxDB – Built for High-Performance Time Series Workloads
InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
www.influxdata.com
featured
  1. xsv

    Discontinued A fast CSV command line toolkit written in Rust.

    If it could be tabular in nature, maybe convert to sqlite3 so you can make use of indexing, or CSV to make use of high-performance tools like xsv or zsv (the latter of which I'm an author).

    https://github.com/BurntSushi/xsv

    https://github.com/liquidaty/zsv/blob/main/docs/csv_json_sql...

  2. Stream

    Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.

    Stream logo
  3. ClickHouse

    ClickHouse® is a real-time analytics database management system

    It is not necessary to write a code with Rust and simdjson. You can insert the dataset into ClickHouse as described here: https://github.com/ClickHouse/ClickHouse/issues/22482

    Running full scan for ClickHouse in body took 140 seconds on a single c5ad.24xlarge machine, the bottleneck is disk read (only 24 CPU cores are used).

  4. octosql

    OctoSQL is a query tool that allows you to join, analyse and transform data from multiple databases and file formats using SQL.

  5. json-buffet

    And here's the code: https://github.com/multiversal-ventures/json-buffet

    The API isn't the best. I'd have preferred an iterator based solution as opposed to this callback based one. But we worked with what rapidjson gave us for the proof of concept.

  6. pysimdjson

    Python bindings for the simdjson project.

  7. reddit_mining

    zstd decompression should almost always be very fast. It's faster to decompress than DEFLATE or LZ4 in all the benchmarks that I've seen.

    you might be interested in converting the pushshift data to parquet. Using octosql I'm able to query the submissions data (from the begining of reddit to Sept 2022) in about 10 min

    https://github.com/chapmanjacobd/reddit_mining#how-was-this-...

    Although if you're sending the data to postgres or BigQuery you can probably get better query performance via indexes or parallelism.

  8. json-streamer

    A fast streaming JSON parser for Python that generates SAX-like events using yajl

    Might be useful for some - https://github.com/kashifrazzaqui/json-streamer

  9. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
  10. json-toolkit

    "the best opensource converter I've found across the Internet" -- dene14

    > Also note that this approach generalizes to other text-based formats. If you have 10 gigabyte of CSV, you can use Miller for processing. For binary formats, you could use fq if you can find a workable record separator.

    You can also generalize it without learning a new minilanguage by using https://github.com/tyleradams/json-toolkit which converts csv/binary/whatever to/from json

  11. zsv

    zsv+lib: tabular data swiss-army knife CLI + world's fastest (simd) CSV parser

    If it could be tabular in nature, maybe convert to sqlite3 so you can make use of indexing, or CSV to make use of high-performance tools like xsv or zsv (the latter of which I'm an author).

    https://github.com/BurntSushi/xsv

    https://github.com/liquidaty/zsv/blob/main/docs/csv_json_sql...

  12. semi_index

    Implementation of the JSON semi-index described in the paper "Semi-Indexing Semi-Structured Data in Tiny Space"

  13. jq-zsh-plugin

    jq zsh plugin

    https://github.com/reegnz/jq-zsh-plugin

    I find that for big datasets choosing the right format is crucial. Using json-lines format + some shell filtering (eg. head, tail to limit the range, egrep or ripgrep for the more trivial filtering) to reduce the dataset to a couple of megabytes, then use that jq-repl of mine to iterate fast on the final jq expression.

    I found that the REPL form factor works really well when you don't exactly know what you're digging for.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • csvkit: Command-line tools for working with CSV

    1 project | news.ycombinator.com | 20 Jan 2023
  • Run SQL on CSV, Parquet, JSON, Arrow, Unix Pipes and Google Sheet

    9 projects | news.ycombinator.com | 24 Sep 2022
  • What is the most user friendly way to upload CSV files into a SQL database?

    1 project | /r/SQL | 14 Jul 2022
  • textql VS trdsql - a user suggested alternative

    2 projects | 25 Jun 2022
  • One-liner for running queries against CSV files with SQLite

    1 project | /r/programming | 22 Jun 2022

Did you know that C++ is
the 7th most popular programming language
based on number of references?