Our great sponsors
- CodiumAI - TestGPT | Generating meaningful tests for busy devs
- InfluxDB - Access the most powerful time series database as a service
- Sonar - Write Clean C++ Code. Always.
- ONLYOFFICE ONLYOFFICE Docs — document collaboration in your environment
-
If it could be tabular in nature, maybe convert to sqlite3 so you can make use of indexing, or CSV to make use of high-performance tools like xsv or zsv (the latter of which I'm an author).
https://github.com/BurntSushi/xsv
https://github.com/liquidaty/zsv/blob/main/docs/csv_json_sql...
-
It is not necessary to write a code with Rust and simdjson. You can insert the dataset into ClickHouse as described here: https://github.com/ClickHouse/ClickHouse/issues/22482
Running full scan for ClickHouse in body took 140 seconds on a single c5ad.24xlarge machine, the bottleneck is disk read (only 24 CPU cores are used).
-
CodiumAI
TestGPT | Generating meaningful tests for busy devs. Get non-trivial tests (and trivial, too!) suggested right inside your IDE, so you can code smart, create more value, and stay confident when you push.
-
octosql
OctoSQL is a query tool that allows you to join, analyse and transform data from multiple databases and file formats using SQL.
-
And here's the code: https://github.com/multiversal-ventures/json-buffet
The API isn't the best. I'd have preferred an iterator based solution as opposed to this callback based one. But we worked with what rapidjson gave us for the proof of concept.
-
-
zstd decompression should almost always be very fast. It's faster to decompress than DEFLATE or LZ4 in all the benchmarks that I've seen.
you might be interested in converting the pushshift data to parquet. Using octosql I'm able to query the submissions data (from the begining of reddit to Sept 2022) in about 10 min
https://github.com/chapmanjacobd/reddit_mining#how-was-this-...
Although if you're sending the data to postgres or BigQuery you can probably get better query performance via indexes or parallelism.
-
Might be useful for some - https://github.com/kashifrazzaqui/json-streamer
-
InfluxDB
Access the most powerful time series database as a service. Ingest, store, & analyze all types of time series data in a fully-managed, purpose-built database. Keep data forever with low-cost storage and superior data compression.
-
> Also note that this approach generalizes to other text-based formats. If you have 10 gigabyte of CSV, you can use Miller for processing. For binary formats, you could use fq if you can find a workable record separator.
You can also generalize it without learning a new minilanguage by using https://github.com/tyleradams/json-toolkit which converts csv/binary/whatever to/from json
-
zsv
zsv+lib: world's fastest (simd) CSV parser, bare metal or wasm, with an extensible CLI for SQL querying, format conversion and more
If it could be tabular in nature, maybe convert to sqlite3 so you can make use of indexing, or CSV to make use of high-performance tools like xsv or zsv (the latter of which I'm an author).
https://github.com/BurntSushi/xsv
https://github.com/liquidaty/zsv/blob/main/docs/csv_json_sql...
-
semi_index
Implementation of the JSON semi-index described in the paper "Semi-Indexing Semi-Structured Data in Tiny Space"
-
https://github.com/reegnz/jq-zsh-plugin
I find that for big datasets choosing the right format is crucial. Using json-lines format + some shell filtering (eg. head, tail to limit the range, egrep or ripgrep for the more trivial filtering) to reduce the dataset to a couple of megabytes, then use that jq-repl of mine to iterate fast on the final jq expression.
I found that the REPL form factor works really well when you don't exactly know what you're digging for.
Related posts
- csvkit: Command-line tools for working with CSV
- Run SQL on CSV, Parquet, JSON, Arrow, Unix Pipes and Google Sheet
- What is the most user friendly way to upload CSV files into a SQL database?
-
textql VS trdsql - a user suggested alternative
2 projects | 25 Jun 2022
- One-liner for running queries against CSV files with SQLite