Our great sponsors
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
opteryx
🦖 A SQL-on-everything Query Engine you can execute over multiple databases and file formats. Query your data, where it lives.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
Interesting, I was wondering if you considered building on top of https://github.com/apache/arrow-datafusion-python
I really do think a distributed db with compute/storage separation and optimized for feature engineering/dataloading (for training NNs) is underserved.
I'd be very interested in the time series aspects of what you're building.
Can you explain how this might differ from something like https://github.com/apache/arrow-ballista
I've seen several variants of "next-gen" spark, but nowhere have I really seen the different tradeoffs/advantages/disadvantages between them.
When people say "pure X", to me, it normally means they didn't involve an FFI or external compiler. This is an often beneficial thing since it simplifies your build process.
For example, here [0] is a "pure Python postgres driver" and the implication is that it doesn't use libpg.
Or see also this discussion [1].
[0] https://github.com/tlocke/pg8000
[1] https://www.reddit.com/r/learnpython/comments/nktut1/eli5_th...
Yes, we have basic support.
Here are some examples of how to use it in python:
https://github.com/pola-rs/polars/blob/91a419acaf024e64410e7...
However, full sql support is on the roadmap. It's just a matter of hours in a day...
Thanks for sharing.
I have a SQL Engine in Python too (https://github.com/mabel-dev/opteryx). I focused my initial effort on supporting SQL statements and making the usage feel like a database - that probably reflects the problem I had in front of me when I set out - only handling handfuls of gigabytes in a batch environment for ETLs with a group of new-to-data-engineering engineers. Have recently started looking more at real-time performance, such as distributing work. Am interesting in how you've approached.
It uses https://github.com/sqlparser-rs/sqlparser-rs as the parser and lexer. The binder, planner, optimizer and executor are in Python. The optimizer stage only works on the logical plan and the rules are heuristic only.
Related posts
- Polars
- How moving from Pandas to Polars made me write better code without writing better code
- I used multiprocessing and multithreading at the same time to drop the execution time of my code from 155+ seconds to just over 2+ seconds
- Data Engineering with Rust
- Any job processing framework like Spark but in Rust?