Our great sponsors
-
If it could be tabular in nature, maybe convert to sqlite3 so you can make use of indexing, or CSV to make use of high-performance tools like xsv or zsv (the latter of which I'm an author).
https://github.com/BurntSushi/xsv
https://github.com/liquidaty/zsv/blob/main/docs/csv_json_sql...
-
It is not necessary to write a code with Rust and simdjson. You can insert the dataset into ClickHouse as described here: https://github.com/ClickHouse/ClickHouse/issues/22482
Running full scan for ClickHouse in body took 140 seconds on a single c5ad.24xlarge machine, the bottleneck is disk read (only 24 CPU cores are used).
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
octosql
OctoSQL is a query tool that allows you to join, analyse and transform data from multiple databases and file formats using SQL.
-
And here's the code: https://github.com/multiversal-ventures/json-buffet
The API isn't the best. I'd have preferred an iterator based solution as opposed to this callback based one. But we worked with what rapidjson gave us for the proof of concept.
-
-
zstd decompression should almost always be very fast. It's faster to decompress than DEFLATE or LZ4 in all the benchmarks that I've seen.
you might be interested in converting the pushshift data to parquet. Using octosql I'm able to query the submissions data (from the begining of reddit to Sept 2022) in about 10 min
https://github.com/chapmanjacobd/reddit_mining#how-was-this-...
Although if you're sending the data to postgres or BigQuery you can probably get better query performance via indexes or parallelism.
-
Might be useful for some - https://github.com/kashifrazzaqui/json-streamer
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
> Also note that this approach generalizes to other text-based formats. If you have 10 gigabyte of CSV, you can use Miller for processing. For binary formats, you could use fq if you can find a workable record separator.
You can also generalize it without learning a new minilanguage by using https://github.com/tyleradams/json-toolkit which converts csv/binary/whatever to/from json
-
zsv
zsv+lib: world's fastest (simd) CSV parser, bare metal or wasm, with an extensible CLI for SQL querying, format conversion and more
If it could be tabular in nature, maybe convert to sqlite3 so you can make use of indexing, or CSV to make use of high-performance tools like xsv or zsv (the latter of which I'm an author).
https://github.com/BurntSushi/xsv
https://github.com/liquidaty/zsv/blob/main/docs/csv_json_sql...
-
semi_index
Implementation of the JSON semi-index described in the paper "Semi-Indexing Semi-Structured Data in Tiny Space"
-
https://github.com/reegnz/jq-zsh-plugin
I find that for big datasets choosing the right format is crucial. Using json-lines format + some shell filtering (eg. head, tail to limit the range, egrep or ripgrep for the more trivial filtering) to reduce the dataset to a couple of megabytes, then use that jq-repl of mine to iterate fast on the final jq expression.
I found that the REPL form factor works really well when you don't exactly know what you're digging for.
Related posts
- csvkit: Command-line tools for working with CSV
- Run SQL on CSV, Parquet, JSON, Arrow, Unix Pipes and Google Sheet
- What is the most user friendly way to upload CSV files into a SQL database?
-
textql VS trdsql - a user suggested alternative
2 projects | 25 Jun 2022
- One-liner for running queries against CSV files with SQLite