Top 5 Python data-pipeline Projects
-
If you want to schedule your ETL, you can do something basic using Windows Task Scheduler or use something fancy like a Python orchestration library like dagster. Dagster works on Windows OS which is probably your best bet as most/all other orchestration libraries wiith a scheduler dont work on Windows.
-
Activeloop Hub
Dataset format for AI. Build, manage, query & visualize datasets for deep learning. Stream data real-time to PyTorch/TensorFlow & version-control it. https://activeloop.ai (by activeloopai)
Hey u/platoTheSloth, as u/gopietz mentioned (thanks a lot for the shout-out!!!), you can share them with the general public through uploading to Activeloop Platform (for researchers, we offer special terms, but even as a general public member you get up to 300GBs of free storage!). Thanks to our open source dataset format for AI, Hub, anyone can load the dataset in under 3seconds with one line of code, and stream it while training in PyTorch/TensorFlow.
-
SonarQube
Static code analysis for 29 languages.. Your projects are multi-language. So is SonarQube analysis. Find Bugs, Vulnerabilities, Security Hotspots, and Code Smells so you can release quality code every time. Get started analyzing your projects today for free.
-
+1 on a lightweight version of GE to more easily make part of an existing pipeline. Would like it for internal use (our data pipelines), but also for our open source users (https://github.com/orchest/orchest).
-
-
Project mention: Launch HN: Elementary (YC W22) – Open-source data observability | news.ycombinator.com | 2022-03-04
For any dbt users, their reliability package has the best and most comprehensive way to upload artifacts directly to the warehouse after a dbt invocation.
Python data-pipelines related posts
- [Q] where to host 50GB dataset (for free?)
- ETL advice appreciated
- Workflow automation for smaller use-cases
- [N] [P] Access 100+ image, video & audio datasets in seconds with one line of code & stream them while training ML models with Activeloop Hub (more at docs.activeloop.ai, description & links in the comments below)
- Thinking of making a switch from actuarial science to data engineering
- Easy way to load, create, version, query and visualize computer vision datasets
- Easy way to load, create, version, query & visualize machine learning datasets
Index
What are some of the best open-source data-pipeline projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | dagster | 4,908 |
2 | Activeloop Hub | 4,633 |
3 | orchest | 3,042 |
4 | patterns-devkit | 74 |
5 | dbt-data-reliability | 36 |
Are you hiring? Post a new remote job listing for free.