Our great sponsors
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
zsv
zsv+lib: world's fastest (simd) CSV parser, bare metal or wasm, with an extensible CLI for SQL querying, format conversion and more
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
Hi, author of q here.
Regarding the error you got, q currently does not autodetect headers, so you'd need to add -H as a flag in order to use the "country" column name. You're absolutely correct on failing-fast here - It's a bug which i'll fix.
In general regarding speed - q supports automatic caching of the CSV files (through the "-C readwrite" flag). Once it's activated, it will write the data into another file (with a .qsql extension), and will use it automatically in further queries in order to speed things considerably.
Effectively, the .qsql files are regular sqlite3 files (with some metadata), and q can be used to query them directly (or any regular sqlite3 file), including the ability to seamlessly join between multiple sqlite3 files.
http://harelba.github.io/q/#auto-caching-examples
I have been using Dremio to query large volume of CSV files: https://docs.dremio.com/software/data-sources/files-and-dire...
Although having them in some columnar format is much better for fast responses.
GitHub: https://github.com/dremio/dremio-oss
"Sqawk is an Awk-like program that uses SQL and can combine data from multiple files. It is powered by SQLite."
https://github.com/dbohdan/sqawk
Reminds me of the textQL extension that's available in Asciidoc.
Point it to an external CSV file, enable TextQL, and bam, there's your query returned as a table. Handy for parts lists, inventory, that kind of crap.
https://github.com/dinedal/textql
https://gist.github.com/mojavelinux/8856117
When I want to quickly query a csv file (usually log files), I like to use lnav which also supports running SQL queries and supports pretty much any log file format I happened to deal with.
https://lnav.org/
Another similar tool for querying CSV with SQL is Steampipe (https://steampipe.io). Steampipe is much broader in concept giving you the ability to query and join many cloud services with SQL as well: https://hub.steampipe.io/plugins
There is also an interesting dashboards as code concept where you can codify interactive dashboards with HCL + SQL: https://steampipe.io/blog/dashboards-as-code
Nice work. I am a fan of tools like this and look forward to giving this a try.
However, in my first attempted query (version 3.1.6 on MacOS), I ran into significant performance limitations and more importantly, it did not give correct output.
In particular, running on a narrow table with 1mm rows (the same one used in the xsv examples) using the command "select country, count() from worldcitiespop_mil.csv group by country" takes 12 seconds just to get an incorrect error 'no such column: country'.
using sqlite3, it takes two seconds or so to load, and less than a second to run, and gives me the correct result.
Using https://github.com/liquidaty/zsv (disclaimer, I'm one of its authors), I get the correct results in 0.95 seconds with the one-liner `zsv sql 'select country, count() from data group by country' worldcitiespop_mil.csv`.
I look forward to trying it again sometime soon
Q runs SQLite under the hood. If you want real firepower, take a look at DuckDB, the codebase is also a work of art.
https://github.com/duckdb/duckdb
The large binary size argument is a bit old.
It seems possible to reduce it below 50MB.
https://github.com/ClickHouse/ClickHouse/issues/29378
I will give it a shot!
The web application is just a UI to get started. It acts like a database providing you with jdbc and odbc drivers and arrow flight protocol: https://github.com/dremio-hub/arrow-flight-client-examples