Q – Run SQL Directly on CSV or TSV Files

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • q

    q - Run SQL directly on delimited files and multi-file sqlite databases (by harelba)

  • Hi, author of q here.

    Regarding the error you got, q currently does not autodetect headers, so you'd need to add -H as a flag in order to use the "country" column name. You're absolutely correct on failing-fast here - It's a bug which i'll fix.

    In general regarding speed - q supports automatic caching of the CSV files (through the "-C readwrite" flag). Once it's activated, it will write the data into another file (with a .qsql extension), and will use it automatically in further queries in order to speed things considerably.

    Effectively, the .qsql files are regular sqlite3 files (with some metadata), and q can be used to query them directly (or any regular sqlite3 file), including the ability to seamlessly join between multiple sqlite3 files.

    http://harelba.github.io/q/#auto-caching-examples

  • dremio-oss

    Dremio - the missing link in modern data

  • I have been using Dremio to query large volume of CSV files: https://docs.dremio.com/software/data-sources/files-and-dire...

    Although having them in some columnar format is much better for fast responses.

    GitHub: https://github.com/dremio/dremio-oss

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • sqawk

    Like awk but with SQL and table joins

  • "Sqawk is an Awk-like program that uses SQL and can combine data from multiple files. It is powered by SQLite."

    https://github.com/dbohdan/sqawk

  • textql

    Execute SQL against structured text like CSV or TSV

  • Reminds me of the textQL extension that's available in Asciidoc.

    Point it to an external CSV file, enable TextQL, and bam, there's your query returned as a table. Handy for parts lists, inventory, that kind of crap.

    https://github.com/dinedal/textql

    https://gist.github.com/mojavelinux/8856117

    When I want to quickly query a csv file (usually log files), I like to use lnav which also supports running SQL queries and supports pretty much any log file format I happened to deal with.

    https://lnav.org/

  • steampipe

    Zero-ETL, infinite possibilities. Live query APIs, code & more with SQL. No DB required.

  • Another similar tool for querying CSV with SQL is Steampipe (https://steampipe.io). Steampipe is much broader in concept giving you the ability to query and join many cloud services with SQL as well: https://hub.steampipe.io/plugins

    There is also an interesting dashboards as code concept where you can codify interactive dashboards with HCL + SQL: https://steampipe.io/blog/dashboards-as-code

  • zsv

    zsv+lib: world's fastest (simd) CSV parser, bare metal or wasm, with an extensible CLI for SQL querying, format conversion and more

  • Nice work. I am a fan of tools like this and look forward to giving this a try.

    However, in my first attempted query (version 3.1.6 on MacOS), I ran into significant performance limitations and more importantly, it did not give correct output.

    In particular, running on a narrow table with 1mm rows (the same one used in the xsv examples) using the command "select country, count() from worldcitiespop_mil.csv group by country" takes 12 seconds just to get an incorrect error 'no such column: country'.

    using sqlite3, it takes two seconds or so to load, and less than a second to run, and gives me the correct result.

    Using https://github.com/liquidaty/zsv (disclaimer, I'm one of its authors), I get the correct results in 0.95 seconds with the one-liner `zsv sql 'select country, count() from data group by country' worldcitiespop_mil.csv`.

    I look forward to trying it again sometime soon

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • duckdb

    DuckDB is an in-process SQL OLAP Database Management System

  • Q runs SQLite under the hood. If you want real firepower, take a look at DuckDB, the codebase is also a work of art.

    https://github.com/duckdb/duckdb

  • ClickHouse

    ClickHouse® is a free analytics DBMS for big data

  • The large binary size argument is a bit old.

    It seems possible to reduce it below 50MB.

    https://github.com/ClickHouse/ClickHouse/issues/29378

  • I will give it a shot!

    The web application is just a UI to get started. It acts like a database providing you with jdbc and odbc drivers and arrow flight protocol: https://github.com/dremio-hub/arrow-flight-client-examples

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts