opteryx
emr-serverless-samples
opteryx | emr-serverless-samples | |
---|---|---|
1 | 4 | |
43 | 140 | |
- | 3.6% | |
9.8 | 6.5 | |
6 days ago | about 2 months ago | |
Python | Python | |
Apache License 2.0 | MIT No Attribution |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
opteryx
-
Pure Python Distributed SQL Engine
Thanks for sharing.
I have a SQL Engine in Python too (https://github.com/mabel-dev/opteryx). I focused my initial effort on supporting SQL statements and making the usage feel like a database - that probably reflects the problem I had in front of me when I set out - only handling handfuls of gigabytes in a batch environment for ETLs with a group of new-to-data-engineering engineers. Have recently started looking more at real-time performance, such as distributing work. Am interesting in how you've approached.
emr-serverless-samples
-
Hi, i want to convert existing EMR on EC2 CLuster into EMR Serverless.
In addition to the Serverless docs, be sure to check out the emr-serverless-samples GitHub repo.
- Should I study these topics or should I skip?
- emr-serverless-samples: Example code for running Spark and Hive jobs on EMR Serverless.
-
Running Delta Lake on Amazon EMR Serverless
To use java dependencies, we have to build them manually into a single .jar file. AWS has provided a Dockerfile that we can use to build the dependencies without having to install maven locally (😍). I used this pom.xml file to define the dependencies:
What are some alternatives?
quokka - Making data lake work for time series
cube.js - 📊 Cube — The Semantic Layer for Building Data Applications
nomad - Deprecated and re-branded as Alto
AWS Data Wrangler - pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
influxdb3-python - Python module that provides a simple and convenient way to interact with InfluxDB 3.0.
Redash - Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.
pg8000 - A Pure-Python PostgreSQL Driver
data-science-ipython-notebooks - Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
datafusion-ballista - Apache Arrow Ballista Distributed Query Engine
maven-mvnd - Apache Maven Daemon
datafusion-python - Apache DataFusion Python Bindings
sqlparser-rs - Extensible SQL Lexer and Parser for Rust