opteryx
quokka
opteryx | quokka | |
---|---|---|
1 | 23 | |
43 | 1,082 | |
- | - | |
9.8 | 8.3 | |
6 days ago | 7 months ago | |
Python | Python | |
Apache License 2.0 | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
opteryx
-
Pure Python Distributed SQL Engine
Thanks for sharing.
I have a SQL Engine in Python too (https://github.com/mabel-dev/opteryx). I focused my initial effort on supporting SQL statements and making the usage feel like a database - that probably reflects the problem I had in front of me when I set out - only handling handfuls of gigabytes in a batch environment for ETLs with a group of new-to-data-engineering engineers. Have recently started looking more at real-time performance, such as distributing work. Am interesting in how you've approached.
quokka
-
How Query Engines Work
An awesome read!
Something related that I found out about from HN a few months back is another engine called quokka. It's particularly interesting and applicable how quokka schedules distributed queries to outperform Spark https://github.com/marsupialtail/quokka/blob/master/blog/why...
- Quokka – Distributed Polars on Ray
-
Algorithmic Trading with Go
Hi Justin, you might be interested in my blog: https://github.com/marsupialtail/quokka/blob/master/blog/bac... advocating a cloud based approach.
You don't have to use the system I am building, but it's worth thinking about that design.
-
Daft: A High-Performance Distributed Dataframe Library for Multimodal Data
SQL support is very challenging.
I work on Quokka (https://github.com/marsupialtail/quokka). I support Iceberg reads. Recently we are adding SQL support from just parsing the DuckDB logical plan, though that is very challenging as well.
The Python world lacks a standard for a plug and play SQL query optimizer. Apache Calcite is good for the JVM world, but not great if you are trying to cut out the JVM.
- Why your dataframe library needs to understand vector embeddings
-
The Inner Workings of Distributed Databases
In case people are interested, I wrote a post about fault tolerance strategies of data systems like Spark and Flink: https://github.com/marsupialtail/quokka/blob/master/blog/fau...
The key difference here is that these systems don't store data, so fault tolerance means recovering within a query instead of not losing data.
-
Launch HN: DAGWorks – ML platform for data science teams
would love to collaborate on an integration with pyquokka (https://github.com/marsupialtail/quokka) once I put out a stable release end of this month :-)
-
is spark always your go to solution ?
Then you should keep an eye on quokka. This may become the "Spark" for Polars/DuckDB. It seems to be under active development though I'm not sure how stable it is.
- Distributed fault tolerance made simple
- Fault tolerance for distributed data systems is quite simple
What are some alternatives?
nomad - Deprecated and re-branded as Alto
cempaka - "Write a trading bot which buys low and sells high." Sounds simple enough, right?
influxdb3-python - Python module that provides a simple and convenient way to interact with InfluxDB 3.0.
awesome-pipeline - A curated list of awesome pipeline toolkits inspired by Awesome Sysadmin
pg8000 - A Pure-Python PostgreSQL Driver
spyql - Query data on the command line with SQL-like SELECTs powered by Python expressions
datafusion-ballista - Apache Arrow Ballista Distributed Query Engine
emr-serverless-samples - Example code for running Spark and Hive jobs on EMR Serverless.
blog - Some notes on things I find interesting and important.
datafusion-python - Apache DataFusion Python Bindings
sqlglot - Python SQL Parser and Transpiler