Our great sponsors
-
SPyQL is really cool and its design is very smart, with it being able to leverage normal Python functions!
As far as similar tools go, I recommend taking a look at DataFusion[0], dsq[1], and OctoSQL[2].
DataFusion is a very (very very) fast command-line SQL engine but with limited support for data formats.
dsq is based on SQLite which means it has to load data into SQLite first, but then gives you the whole breath of SQLite, it also supports many data formats, but is slower at the same time.
OctoSQL is faster, extensible through plugins, and supports incremental query execution, so you can i.e. calculate a running group by + count while tailing a log file. It also supports normal databases, not just file formats, so you can i.e. join with a Postgres table.
[0]: https://github.com/apache/arrow-datafusion
[1]: https://github.com/multiprocessio/dsq
[2]: https://github.com/cube2222/octosql
Disclaimer: Author of OctoSQL
-
It could be the NDJSON parser (DF source: [0]) or could be a variety of other factors. Looking at the ROAPI release archive [1], it doesn't ship with the definitive `columnq` binary from your comment, so it could also have something to do with compilation-time flags.
FWIW, we use the Parquet format with DataFusion and get very good speeds similar to DuckDB [2], e.g. 1.5s to run a more complex aggregation query `SELECT date_trunc('month', tpep_pickup_datetime) AS month, COUNT(*) AS total_trips, SUM(total_amount) FROM tripdata GROUP BY 1 ORDER BY 1 ASC)` on a 55M row subset of NY Taxi trip data.
[0]: https://github.com/apache/arrow-datafusion/blob/master/dataf...
[1]: https://github.com/roapi/roapi/releases/tag/roapi-v0.8.0
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
SPyQL is really cool and its design is very smart, with it being able to leverage normal Python functions!
As far as similar tools go, I recommend taking a look at DataFusion[0], dsq[1], and OctoSQL[2].
DataFusion is a very (very very) fast command-line SQL engine but with limited support for data formats.
dsq is based on SQLite which means it has to load data into SQLite first, but then gives you the whole breath of SQLite, it also supports many data formats, but is slower at the same time.
OctoSQL is faster, extensible through plugins, and supports incremental query execution, so you can i.e. calculate a running group by + count while tailing a log file. It also supports normal databases, not just file formats, so you can i.e. join with a Postgres table.
[0]: https://github.com/apache/arrow-datafusion
[1]: https://github.com/multiprocessio/dsq
[2]: https://github.com/cube2222/octosql
Disclaimer: Author of OctoSQL
-
Related posts
- The fastest command-line tools for querying large JSON datasets
- I want to convert a large JSON file into Tabular Format.
- The fastest tools for querying large JSON files (updated benchmark)
- SPyQL - SQL with Python in the middle - Making command-line data processing more intuitive, readable and powerful
- Show HN: Query any kind of data with SQL powered by Python