The fastest tool for querying large JSON files is written in Python (benchmark)

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

simdjson

65 18,362 9.2 C++

Parsing gigabytes of JSON per second : used by Facebook/Meta Velox, the Node.js runtime, ClickHouse, WatermelonDB, Apache Doris, Milvus, StarRocks

Daniel Lemire’s simdjson probably belongs in this discussion and I would be surprised if it is not the fastest tool by some margin.
https://github.com/simdjson/simdjson

ojg

17 791 7.0 Go

Optimized JSON for Go

For me OjG (https://github.com/ohler55/ojg) has been great. I regularly use it on files that can not be loaded into memory. The best JSON file format for multiple record is one JSON document per record all in the same file. OjG doesn't care if they are on different lines. It is fast (https://github.com/ohler55/compare-go-json) and uses a fairly complete JSONPath implementation for searches. Similar to jq but using JSONPath instead of a proprietary query language.
I am biased though as I wrote OjG to handle what other tools were not able to do.

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
compare-go-json

5 18 0.0 Go

A comparison of several go JSON packages.

For me OjG (https://github.com/ohler55/ojg) has been great. I regularly use it on files that can not be loaded into memory. The best JSON file format for multiple record is one JSON document per record all in the same file. OjG doesn't care if they are on different lines. It is fast (https://github.com/ohler55/compare-go-json) and uses a fairly complete JSONPath implementation for searches. Similar to jq but using JSONPath instead of a proprietary query language.
I am biased though as I wrote OjG to handle what other tools were not able to do.

ultrajson

3 4,244 7.0 C

Ultra fast JSON decoder and encoder written in C with Python bindings

I asked about this on the Github issue regarding these benchmarks as well.
I'm curious as to why libraries like ultrajson[0] and orjson[1] weren't explored. They aren't command line tools, but neither is pandas right? Is it perhaps because the code required to implement the challenges is large enough that they are considered too inconvenient to use through the same way pandas was used (ie, `python -c "..."`)?
[0] https://github.com/ultrajson/ultrajson

orjson

17 5,540 8.3 Python

Fast, correct Python JSON library supporting dataclasses, datetimes, and numpy
bert

49 36,945 0.0 Python

TensorFlow code and pre-trained models for BERT

> resulting in large programs with lots of boilerplate
That was what I was trying to say when I said "the code required to implement the challenges is large enough that they are considered too inconvenient to use". This makes sense to me.
Thank you for this benchmark! I'll probably switch to spyql now from jq.
> So, orjson is part of the reason why a python-based tool outperforms tools written in C, Go, etc and deserves credit.
Yes, I definitely think this is worth mentioning upfront in the future, since, IIUC, orison's core uses Rust (the serde library, specifically). The initial title gave me the impression that a pure-Python json parsing-and-querying solution was the fastest out there, which I find misleading.
A parallel I think is helpful is saying something like "the fastest BERT implementation is written Python[0]". While the linked implementation is written in Python, it offloads the performance critical parts to C/C++ through TensorFlow.
[0] https://github.com/google-research/bert

catj

3 1,323 0.0

Displays JSON files in a flat format.

My main problem with jq was finding the paths to the nodes, so I wrote catj (https://github.com/soheilpro/catj) to help with that.

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
db-benchmark

91 319 0.0 R

reproducible benchmark of database-like ops
pysimdjson

6 629 5.3 Python

Python bindings for the simdjson project.

json: 113.79130696877837 ms
While `orjson`, is faster than `ujson`/`json` here, it's only ~6% faster (in this benchmark). `simdjson` and `msgspec` (my library, see https://jcristharif.com/msgspec/) are much faster due to them avoiding creating PyObjects for fields that are never used.
If spyql's query engine can determine the fields it will access statically before processing, you might find using `msgspec` for JSON gives a nice speedup (it'll also type check the JSON if you know the type of each field). If this information isn't known though, you may find using `pysimdjson` (https://pysimdjson.tkte.ch/) gives an easy speed boost, as it should be more of a drop-in for `orjson`.

msgspec

31 1,857 8.9 Python

A fast serialization and validation library, with builtin support for JSON, MessagePack, YAML, and TOML

json: 113.79130696877837 ms
While `orjson`, is faster than `ujson`/`json` here, it's only ~6% faster (in this benchmark). `simdjson` and `msgspec` (my library, see https://jcristharif.com/msgspec/) are much faster due to them avoiding creating PyObjects for fields that are never used.
If spyql's query engine can determine the fields it will access statically before processing, you might find using `msgspec` for JSON gives a nice speedup (it'll also type check the JSON if you know the type of each field). If this information isn't known though, you may find using `pysimdjson` (https://pysimdjson.tkte.ch/) gives an easy speed boost, as it should be more of a drop-in for `orjson`.

datasette

187 8,881 9.2 Python

An open source multi-tool for exploring and publishing data

"Datasette" (from Django co-creator) can take tabular data (SQLite, CSV, JSON, etc) and generate a REST/GraphQL API with visualization tools from it:
https://github.com/simonw/datasette
From the same author, "sqlite-utils" can take tabular data and create SQLite table definitions and rows from them:
https://github.com/simonw/sqlite-utils
"Pipe JSON (or CSV or TSV) directly into a new SQLite database file, automatically creating a table with the appropriate schema"
  > * What sort of JSON "meta-formats" are the most important/common for you? E.g. in a file you could have object-per-line, object-of-arrays, array-of-objects, or in an SQL context you could have object-per-row or object-of-arrays-as-table, etc). I'd love to hear about others that are important to you.

sqlite-utils

35 1,498 8.4 Python

Python CLI utility and library for manipulating SQLite databases

"Datasette" (from Django co-creator) can take tabular data (SQLite, CSV, JSON, etc) and generate a REST/GraphQL API with visualization tools from it:
https://github.com/simonw/datasette
From the same author, "sqlite-utils" can take tabular data and create SQLite table definitions and rows from them:
https://github.com/simonw/sqlite-utils
"Pipe JSON (or CSV or TSV) directly into a new SQLite database file, automatically creating a table with the appropriate schema"
  > * What sort of JSON "meta-formats" are the most important/common for you? E.g. in a file you could have object-per-line, object-of-arrays, array-of-objects, or in an SQL context you could have object-per-row or object-of-arrays-as-table, etc). I'd love to hear about others that are important to you.

ojc

1 35 10.0 C

Optimized JSON in C

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Show HN: Mutuple – Replace items in Python's "immutable" tuples
2 projects | news.ycombinator.com | 10 Apr 2024
Little Data: How do we query personal data? (2013)
1 project | news.ycombinator.com | 1 Mar 2024
What We Watched: A Netflix Engagement Report – About Netflix
1 project | news.ycombinator.com | 12 Dec 2023
pyserde: Serialization library on top of dataclasses, inspired by serde-rs
1 project | news.ycombinator.com | 18 Nov 2023
Effective GPT-4 Programming
2 projects | news.ycombinator.com | 14 Nov 2023

The fastest tool for querying large JSON files is written in Python (benchmark)

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
JSON Python Serialization Sqlite Jsonpath
Post date: 12 Apr 2022

simdjson

ojg

InfluxDB

compare-go-json

ultrajson

orjson

bert

catj

WorkOS

db-benchmark

pysimdjson

msgspec

datasette

sqlite-utils

ojc

Related posts

The fastest tool for querying large JSON files is written in Python (benchmark)

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com JSON Python Serialization Sqlite Jsonpath Post date: 12 Apr 2022

Related posts

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
JSON Python Serialization Sqlite Jsonpath
Post date: 12 Apr 2022