mack
petastorm
mack | petastorm | |
---|---|---|
5 | 2 | |
271 | 1,752 | |
- | 0.7% | |
5.9 | 3.7 | |
3 months ago | 5 months ago | |
Python | Python | |
MIT License | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
mack
-
Implementing and using SCD Type 2
There still library form databricks? But I have never used it: https://github.com/MrPowers/mack
-
Spark/databricks seems amazing?
I was a Databricks user for 5 years and spent almost all my time inside the IntelliJ IDE developing code. I wrote almost all code in a text editor, unit tested all code (actually authored the popular Scala Spark / PySpark testing libraries: https://github.com/MrPowers/) and had everything up with CI/CD. Lots of OSS PySpark/Scala Spark work too. I only used Databricks notebooks for data exploration and for lightweight notebooks that would invoke functions (that were defined in Python Wheel / JAR files). I am on the Delta Lake team at Databricks now and still do all my work in text editors (see this project: https://github.com/MrPowers/mack) and create lots of examples in Jupyter Notebooks. So I definitely think it's possible to limit notebook exposure.
-
PySpark OSS Contribution Opportunity
Great, would love your help. You can also check out the mack project if you'd like to work on a Delta Lake + PySpark project: https://github.com/MrPowers/mack/issues
-
Spark open source community is awesome
a couple devs just added a `find_compositite_keys_candidates` function so users can easily identify columns that could be used as a unique identifier in their Delta table.
-
How to append data to Delta tables without adding any duplicates
Fair points. Here's the code repo: https://github.com/MrPowers/mack
petastorm
What are some alternatives?
chispa - PySpark test helper methods with beautiful error messages
horovod - Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
delta-rs - A native Rust library for Delta Lake, with bindings into Python
Activeloop Hub - Data Lake for Deep Learning. Build, manage, query, version, & visualize datasets. Stream data real-time to PyTorch/TensorFlow. https://activeloop.ai [Moved to: https://github.com/activeloopai/deeplake]
os-lib - OS-Lib is a simple, flexible, high-performance Scala interface to common OS filesystem and subprocess APIs
d2l-en - Interactive deep learning book with multi-framework code, math, and discussions. Adopted at 500 universities from 70 countries including Stanford, MIT, Harvard, and Cambridge.
jodie - Delta lake and filesystem helper methods
best-of-ml-python - 🏆 A ranked list of awesome machine learning Python libraries. Updated weekly.
nni - An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
jina - ☁️ Build multimodal AI applications with cloud-native stack
wandb - 🔥 A tool for visualizing and tracking your machine learning experiments. This repo contains the CLI and Python API.
SysML-v2-Release - The latest incremental release of SysML v2. Start here.