Pure Python Distributed SQL Engine

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • datafusion-python

    Apache Arrow DataFusion Python Bindings

  • Interesting, I was wondering if you considered building on top of https://github.com/apache/arrow-datafusion-python

    I really do think a distributed db with compute/storage separation and optimized for feature engineering/dataloading (for training NNs) is underserved.

    I'd be very interested in the time series aspects of what you're building.

  • quokka

    Making data lake work for time series (by marsupialtail)

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • arrow-ballista

    Apache Arrow Ballista Distributed Query Engine

  • Can you explain how this might differ from something like https://github.com/apache/arrow-ballista

    I've seen several variants of "next-gen" spark, but nowhere have I really seen the different tradeoffs/advantages/disadvantages between them.

  • sqlglot

    Python SQL Parser and Transpiler

  • pg8000

    A Pure-Python PostgreSQL Driver

  • When people say "pure X", to me, it normally means they didn't involve an FFI or external compiler. This is an often beneficial thing since it simplifies your build process.

    For example, here [0] is a "pure Python postgres driver" and the implication is that it doesn't use libpg.

    Or see also this discussion [1].

    [0] https://github.com/tlocke/pg8000

    [1] https://www.reddit.com/r/learnpython/comments/nktut1/eli5_th...

  • polars

    Dataframes powered by a multithreaded, vectorized query engine, written in Rust

  • Yes, we have basic support.

    Here are some examples of how to use it in python:

    https://github.com/pola-rs/polars/blob/91a419acaf024e64410e7...

    However, full sql support is on the roadmap. It's just a matter of hours in a day...

  • opteryx

    🦖 A SQL-on-everything Query Engine you can execute over multiple databases and file formats. Query your data, where it lives.

  • Thanks for sharing.

    I have a SQL Engine in Python too (https://github.com/mabel-dev/opteryx). I focused my initial effort on supporting SQL statements and making the usage feel like a database - that probably reflects the problem I had in front of me when I set out - only handling handfuls of gigabytes in a batch environment for ETLs with a group of new-to-data-engineering engineers. Have recently started looking more at real-time performance, such as distributing work. Am interesting in how you've approached.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • sqlparser-rs

    Extensible SQL Lexer and Parser for Rust

  • It uses https://github.com/sqlparser-rs/sqlparser-rs as the parser and lexer. The binder, planner, optimizer and executor are in Python. The optimizer stage only works on the logical plan and the rules are heuristic only.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts