OctoSQL allows you to join data from different sources using SQL

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • octosql

    OctoSQL is a query tool that allows you to join, analyse and transform data from multiple databases and file formats using SQL.

  • Hey!

    > I think the main fundamental difference is that this wants all of the data upfront in a data file.

    Absolutely not! Moreover, OctoSQL can push down predicates to databases so that it only has to download a small subset of the table, if the datasource and query allow it.

    > Very easy to model HTTP APIs as a table.

    "Very easy" is relative, but you can take a look at the random_data[0] datasource which is exactly this. I'm also planning to add a GitHub datasource fairly soon. That said, there is Steampipe[1] for which this is the main use case afaik (hitting API's and exposing them as tables through Postgres FWD's written in Go), so it might be a smoother and more polished experience. There's also tons of plugins already available for it.

    > Easy to model basically anything as a table for example files on my filesystem.

    Yep, definitely. That's the idea behind OctoSQL. Strive to create a tool for easily exposing anything through SQL (like your machine's processes list, an API, and join that with a file, or database). There's still lot's of documentation work left to do though, in order to make the plugin authoring experience easier.

    > A decent query planner so that I can avoid expensive things (like API calls) if I can determine if I need the object based on something cheaper (like a local disk access).

    Probably depends on the use-case, and it sometimes needs you to be fairly explicit, but OctoSQL does in fact do that. It will push down predicates to underlying databases, which means joining something small with something very big (while only taking very small amounts of the latter) can be very fast with LOOKUP JOIN's.

    > I want something that is easy to extend to sources that are possibly non-listable or at the very least I don't want to have all of the data available.

    Doable. An example of this is the `plugins.available_versions` table[2]. It requires you to provide the plugin name as a predicate, as the versions need to be downloaded from the plugin's own repository (and listing all plugin repositories on each query isn't really what you want to be doing). You can also LOOKUP JOIN with the `plugins.available_plugins` table if that is indeed what you want.

    [0]: https://github.com/cube2222/octosql-plugin-random_data

    [1]: https://steampipe.io

    [2]: https://github.com/cube2222/octosql/blob/main/datasources/pl...

  • steampipe

    Zero-ETL, infinite possibilities. Live query APIs, code & more with SQL. No DB required.

  • Hey!

    > I think the main fundamental difference is that this wants all of the data upfront in a data file.

    Absolutely not! Moreover, OctoSQL can push down predicates to databases so that it only has to download a small subset of the table, if the datasource and query allow it.

    > Very easy to model HTTP APIs as a table.

    "Very easy" is relative, but you can take a look at the random_data[0] datasource which is exactly this. I'm also planning to add a GitHub datasource fairly soon. That said, there is Steampipe[1] for which this is the main use case afaik (hitting API's and exposing them as tables through Postgres FWD's written in Go), so it might be a smoother and more polished experience. There's also tons of plugins already available for it.

    > Easy to model basically anything as a table for example files on my filesystem.

    Yep, definitely. That's the idea behind OctoSQL. Strive to create a tool for easily exposing anything through SQL (like your machine's processes list, an API, and join that with a file, or database). There's still lot's of documentation work left to do though, in order to make the plugin authoring experience easier.

    > A decent query planner so that I can avoid expensive things (like API calls) if I can determine if I need the object based on something cheaper (like a local disk access).

    Probably depends on the use-case, and it sometimes needs you to be fairly explicit, but OctoSQL does in fact do that. It will push down predicates to underlying databases, which means joining something small with something very big (while only taking very small amounts of the latter) can be very fast with LOOKUP JOIN's.

    > I want something that is easy to extend to sources that are possibly non-listable or at the very least I don't want to have all of the data available.

    Doable. An example of this is the `plugins.available_versions` table[2]. It requires you to provide the plugin name as a predicate, as the versions need to be downloaded from the plugin's own repository (and listing all plugin repositories on each query isn't really what you want to be doing). You can also LOOKUP JOIN with the `plugins.available_plugins` table if that is indeed what you want.

    [0]: https://github.com/cube2222/octosql-plugin-random_data

    [1]: https://steampipe.io

    [2]: https://github.com/cube2222/octosql/blob/main/datasources/pl...

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • materialize

    The data warehouse for operational workloads. (by MaterializeInc)

  • Thanks!

    Definitely considering adding a server-mode with Postgres wire protocol compatibility.

    It's tricky for the more dynamic/dataflow'y parts, as OctoSQL is able to give you a live updating output table (which Postgres wire protocol doesn't support), but I can go with a similar approach as Materialize[0] does for those use cases - creating a live-updating materialized view that you can query from.

    That said, for now I'm still concentrating on the overall local usage experienced and ergonomics, there's still much to improve there.

    [0]: https://materialize.com

  • Thanks!

    I'll start by saying that OctoSQL is single-machine-only, as I'm not sure what exactly you meant with "federated".

    I'd recommend starting by going with a debugger through the execution of the root function in root.go, as that calls out to all the macro transformations.

    Then, you can take a look at the optimizer (optimizer directory) and the Postgres plugin source[0], as an example of a plugin that is able to push down predicates to the underlying database. As well as the Typecheck (logical -> physical) and Materialize (physical -> execution) transformations.

    I'm planning to write a few technical documents about the implementation soon, while writing some actual usage documentation as well.

    [0]: https://github.com/cube2222/octosql-plugin-postgres

  • dsq

    Commandline tool for running SQL queries against JSON, CSV, Excel, Parquet, and more.

  • OctoSQL is an awesome project and Kuba has a lot of great experience to share from building this project I'm excited to learn from.

    And while building a custom database engine does allow you to do pretty quick queries, there are a few issues.

    First, the SQL implemented is nonstandard. As I was looking for documentation and it pointed me to `SELECT * FROM docs.functions fs`. I tried to count the number of functions but octosql crashed (a Go panic) when I ran `SELECT count(1) FROM docs.functions fs` and `SELECT count() FROM docs.functions fs` which is what I lazily do in standard SQL databases. (`SELECT count(fs.name) FROM docs.function fs` worked.)

    This kind of thing will keep happening because this project just doesn't have as much resources today as SQLite, Postgres, DuckDB, etc. It will support a limited subset of SQL.

    Second, the standard library seems pretty small. When I counted the builtin functions there were only 29. Now this is an easy thing to rectify over time but just noting about the state today.

    And third this project only has builtin support for querying CSV and JSON files. Again this could be easy to rectify over time but just mentioning the state today.

    octosql is a great project but there are also different ways to do the same thing.

    I build dsq [0] which runs all queries through SQLite so it avoids point 1. It has access to SQLite's standard builtin functions plus* a battery of extra statistic aggregation, string manipulation, url manipulation, date manipulation, hashing, and math functions custom built to help this kind of interactive querying developers commonly do [1].

    And dsq supports not just CSV and JSON but parquet, excel, ODS, ORC, YAML, TSV, and Apache and nginx logs.

    A downside to dsq is that it is slower for large files (say over 10GB) when you only want a few columns whereas octosql does better in some of those cases. I'm hoping to improve this over time by adding a SQL filtering frontend to dsq but in all cases dsq will ultimately use SQLite as the query engine.

    You can find more info about similar projects in octosql's Benchmark section but I also have a comparison section in dsq [2] and an extension of the octosql benchmark with different set of tools [3] including duckdb.

    Everyone should check out duckdb. :)

    [0] https://github.com/multiprocessio/dsq

    [1] https://github.com/multiprocessio/go-sqlite3-stdlib

    [2] https://github.com/multiprocessio/dsq#comparisons

    [3] https://github.com/multiprocessio/dsq#benchmark

  • go-sqlite3-stdlib

    A standard library for mattn/go-sqlite3 including best-effort date parsing, url parsing, math/string functions, and stats aggregation functions

  • OctoSQL is an awesome project and Kuba has a lot of great experience to share from building this project I'm excited to learn from.

    And while building a custom database engine does allow you to do pretty quick queries, there are a few issues.

    First, the SQL implemented is nonstandard. As I was looking for documentation and it pointed me to `SELECT * FROM docs.functions fs`. I tried to count the number of functions but octosql crashed (a Go panic) when I ran `SELECT count(1) FROM docs.functions fs` and `SELECT count() FROM docs.functions fs` which is what I lazily do in standard SQL databases. (`SELECT count(fs.name) FROM docs.function fs` worked.)

    This kind of thing will keep happening because this project just doesn't have as much resources today as SQLite, Postgres, DuckDB, etc. It will support a limited subset of SQL.

    Second, the standard library seems pretty small. When I counted the builtin functions there were only 29. Now this is an easy thing to rectify over time but just noting about the state today.

    And third this project only has builtin support for querying CSV and JSON files. Again this could be easy to rectify over time but just mentioning the state today.

    octosql is a great project but there are also different ways to do the same thing.

    I build dsq [0] which runs all queries through SQLite so it avoids point 1. It has access to SQLite's standard builtin functions plus* a battery of extra statistic aggregation, string manipulation, url manipulation, date manipulation, hashing, and math functions custom built to help this kind of interactive querying developers commonly do [1].

    And dsq supports not just CSV and JSON but parquet, excel, ODS, ORC, YAML, TSV, and Apache and nginx logs.

    A downside to dsq is that it is slower for large files (say over 10GB) when you only want a few columns whereas octosql does better in some of those cases. I'm hoping to improve this over time by adding a SQL filtering frontend to dsq but in all cases dsq will ultimately use SQLite as the query engine.

    You can find more info about similar projects in octosql's Benchmark section but I also have a comparison section in dsq [2] and an extension of the octosql benchmark with different set of tools [3] including duckdb.

    Everyone should check out duckdb. :)

    [0] https://github.com/multiprocessio/dsq

    [1] https://github.com/multiprocessio/go-sqlite3-stdlib

    [2] https://github.com/multiprocessio/dsq#comparisons

    [3] https://github.com/multiprocessio/dsq#benchmark

  • octosql-plugin-random_data

    OctoSQL plugin serving random data

  • Hey!

    > I think the main fundamental difference is that this wants all of the data upfront in a data file.

    Absolutely not! Moreover, OctoSQL can push down predicates to databases so that it only has to download a small subset of the table, if the datasource and query allow it.

    > Very easy to model HTTP APIs as a table.

    "Very easy" is relative, but you can take a look at the random_data[0] datasource which is exactly this. I'm also planning to add a GitHub datasource fairly soon. That said, there is Steampipe[1] for which this is the main use case afaik (hitting API's and exposing them as tables through Postgres FWD's written in Go), so it might be a smoother and more polished experience. There's also tons of plugins already available for it.

    > Easy to model basically anything as a table for example files on my filesystem.

    Yep, definitely. That's the idea behind OctoSQL. Strive to create a tool for easily exposing anything through SQL (like your machine's processes list, an API, and join that with a file, or database). There's still lot's of documentation work left to do though, in order to make the plugin authoring experience easier.

    > A decent query planner so that I can avoid expensive things (like API calls) if I can determine if I need the object based on something cheaper (like a local disk access).

    Probably depends on the use-case, and it sometimes needs you to be fairly explicit, but OctoSQL does in fact do that. It will push down predicates to underlying databases, which means joining something small with something very big (while only taking very small amounts of the latter) can be very fast with LOOKUP JOIN's.

    > I want something that is easy to extend to sources that are possibly non-listable or at the very least I don't want to have all of the data available.

    Doable. An example of this is the `plugins.available_versions` table[2]. It requires you to provide the plugin name as a predicate, as the versions need to be downloaded from the plugin's own repository (and listing all plugin repositories on each query isn't really what you want to be doing). You can also LOOKUP JOIN with the `plugins.available_plugins` table if that is indeed what you want.

    [0]: https://github.com/cube2222/octosql-plugin-random_data

    [1]: https://steampipe.io

    [2]: https://github.com/cube2222/octosql/blob/main/datasources/pl...

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • cargo-semver-checks

    Scan your Rust crate for semver violations.

  • [2]: https://github.com/obi1kenobi/cargo-semver-check/blob/main/s... -- the query is in GraphQL syntax, you can copy-paste it into an editor to get syntax highlighting

  • noria

    Fast web applications through dynamic, partially-stateful dataflow

  • Materialize is really neat, also checkout https://github.com/mit-pdos/noria. It inverts the query problem and processes the data on insert. Exactly like what most applications end up doing using a no-sql solution.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts