Analyzing multi-gigabyte JSON files locally

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • xsv

    A fast CSV command line toolkit written in Rust.

    If it could be tabular in nature, maybe convert to sqlite3 so you can make use of indexing, or CSV to make use of high-performance tools like xsv or zsv (the latter of which I'm an author).

    https://github.com/BurntSushi/xsv

    https://github.com/liquidaty/zsv/blob/main/docs/csv_json_sql...

  • ClickHouse

    ClickHouse® is a free analytics DBMS for big data

    It is not necessary to write a code with Rust and simdjson. You can insert the dataset into ClickHouse as described here: https://github.com/ClickHouse/ClickHouse/issues/22482

    Running full scan for ClickHouse in body took 140 seconds on a single c5ad.24xlarge machine, the bottleneck is disk read (only 24 CPU cores are used).

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

  • octosql

    OctoSQL is a query tool that allows you to join, analyse and transform data from multiple databases and file formats using SQL.

  • json-buffet

    And here's the code: https://github.com/multiversal-ventures/json-buffet

    The API isn't the best. I'd have preferred an iterator based solution as opposed to this callback based one. But we worked with what rapidjson gave us for the proof of concept.

  • pysimdjson

    Python bindings for the simdjson project.

  • reddit_mining

    zstd decompression should almost always be very fast. It's faster to decompress than DEFLATE or LZ4 in all the benchmarks that I've seen.

    you might be interested in converting the pushshift data to parquet. Using octosql I'm able to query the submissions data (from the begining of reddit to Sept 2022) in about 10 min

    https://github.com/chapmanjacobd/reddit_mining#how-was-this-...

    Although if you're sending the data to postgres or BigQuery you can probably get better query performance via indexes or parallelism.

  • json-streamer

    A fast streaming JSON parser for Python that generates SAX-like events using yajl

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

  • json-toolkit

    "the best opensource converter I've found across the Internet" -- dene14

    > Also note that this approach generalizes to other text-based formats. If you have 10 gigabyte of CSV, you can use Miller for processing. For binary formats, you could use fq if you can find a workable record separator.

    You can also generalize it without learning a new minilanguage by using https://github.com/tyleradams/json-toolkit which converts csv/binary/whatever to/from json

  • zsv

    zsv+lib: world's fastest (simd) CSV parser, bare metal or wasm, with an extensible CLI for SQL querying, format conversion and more

    If it could be tabular in nature, maybe convert to sqlite3 so you can make use of indexing, or CSV to make use of high-performance tools like xsv or zsv (the latter of which I'm an author).

    https://github.com/BurntSushi/xsv

    https://github.com/liquidaty/zsv/blob/main/docs/csv_json_sql...

  • semi_index

    Implementation of the JSON semi-index described in the paper "Semi-Indexing Semi-Structured Data in Tiny Space"

  • jq-zsh-plugin

    jq zsh plugin

    https://github.com/reegnz/jq-zsh-plugin

    I find that for big datasets choosing the right format is crucial. Using json-lines format + some shell filtering (eg. head, tail to limit the range, egrep or ripgrep for the more trivial filtering) to reduce the dataset to a couple of megabytes, then use that jq-repl of mine to iterate fast on the final jq expression.

    I found that the REPL form factor works really well when you don't exactly know what you're digging for.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts