Python Mlops

Open-source Python projects categorized as Mlops | Edit details

Top 23 Python Mlops Projects

  • GitHub repo label-studio

    Label Studio is a multi-type data labeling and annotation tool with standardized output format

    Project mention: [D] Portals for outsourcing preliminary data labeling | reddit.com/r/MachineLearning | 2022-01-13

    Not exactly for this solution, but I have really liked this tool. https://labelstud.io/ It is open source and can be self hosted if needed

  • GitHub repo great_expectations

    Always know what to expect from your data.

    Project mention: [P] Deepchecks: an open-source tool for high standards validations for ML models and data. | reddit.com/r/MachineLearning | 2022-01-06
  • SonarQube

    Static code analysis for 29 languages.. Your projects are multi-language. So is SonarQube analysis. Find Bugs, Vulnerabilities, Security Hotspots, and Code Smells so you can release quality code every time. Get started analyzing your projects today for free.

  • GitHub repo metaflow

    :rocket: Build and manage real-life data science projects with ease!

    Project mention: Best job scheduler in 2022? (Airflow / Dagster / Prefect / Luigi / other) | reddit.com/r/dataengineering | 2022-01-18

    Can I give a plug for Metaflow. It's particularly well suited to data science and ML workflows, with great tooling that's basically just annotations on python functions that gives you: - DAG orchestration - parallelism - cloud integration - data flow through DAGs — very very useful imo for data science teams trying to migrate their existing scripts to (and write new ones on) Metaflow

  • GitHub repo Kedro

    A Python framework for creating reproducible, maintainable and modular data science code.

    Project mention: [Discussion] Applied machine learning implementation debate. Is OOP approach towards data preprocessing in python an overkill? | reddit.com/r/MachineLearning | 2021-11-03

    I'd focus more on understanding the issues in depth, before jumping to a solution. Otherwise, you would be adding hassle with some - bluntly speaking - opinionated and inflexible boilerplate code which not many people will like using. You mention some issues: non-obvious to understand code and hard to execute and replicate. Bad code which is not following engineering best practices (ideas from SOLID etc.) does not get better if you force the author to introduce certain classes. You can suggest some basics (e.g. common code formatter, meaningful variables names, short functions, no hard-coded values, ...), but I'm afraid you cannot educate non-engineers in a single day workshop. I would not focus on that at first. However, there is no excuse for writing bad code and then expecting others to fix. As you say, data engineering is part of data science skills, you are "junior" if you cannot write reproducible code. Being hard to execute and replicate is theoretically easy to fix. Force everyone to (at least hypothetically) submit their code into a testing environment where it will be automatically executed on a fresh machine. This will mean that at first they have to exactly specify all libraries that need to be installed. Second, they need to externalize all configuration - in particular data input and data output paths. Not a single value should be hard-coded in code! And finally they need a *single* command which can be run to execute the whole(!) pipeline. If they fail on any of these parts... they should try again. Work that does not pass this test is considered unfinished by the author. Basically you are introducing an automated, infallible test. Regarding your code, I'd really not try that direction. In particular even these few lines already look unclear and over-engineered. The csv format is already hard-coded into the code. If it changes to parquet you'd have to touch the code. The processing object has data paths fixed for which is no reason in a job which should take care of pure processing. Export data is also not something that a processing job should handle. And what if you have multiple input and output data? You would not have all these issues if you had kept to most simple solution to have a function `process(data1, data2, ...) -> result_data` where dataframes are passed in and out. It would also mean to have zero additional libraries or boilerplate. I highly doubt that a function `main_pipe(...)` will fix the malpractices some people may do. There are two small feature which are useful beyond a plain function though: automatically generating a visual DAG from the code and quick checking if input requirements are satisfied before heavy code is run. You can still put any mature DAG library on top, which probably already includes experience from a lot of developers. Not need to rewrite that. I'm not sure which one is best (metaflow, luigi, airflow, ... https://github.com/pditommaso/awesome-pipeline no idea), but many come with a lot of features. If you want a bit more scaffolding to easier understand foreign projects, you could look at https://github.com/quantumblacklabs/kedro but maybe that's already too much. Fix the "single command replication-from-scratch requirement" first.

  • GitHub repo Activeloop Hub

    Dataset format for AI. Build, manage, & visualize datasets for deep learning. Stream data real-time to PyTorch/TensorFlow & version-control it. https://activeloop.ai (by activeloopai)

    Project mention: The hand-picked selection of the best Python libraries released in 2021 | reddit.com/r/Python | 2021-12-21

    Hub.

  • GitHub repo BentoML

    Model Serving Made Easy

    Project mention: How to Build a Machine Learning Demo in 2022 | dev.to | 2022-01-16

    Using a general-purpose framework such as FastAPI involves writing a lot of boilerplate code just to get your API endpoint up and running. If deploying a model for a demo is the only thing you are interested in and you do not mind losing some flexibility, you might want to use a specialized serving framework instead. One example is BentoML, which will allow you to get an optimized serving endpoint for your model up and running much faster and with less overhead than a generic web framework. Framework-specific serving solutions such as Tensorflow Serving and TorchServe typically offer optimized performance but can only be used to serve models trained using Tensorflow or PyTorch, respectively.

  • GitHub repo clearml

    ClearML - Auto-Magical CI/CD to streamline your ML workflow. Experiment Manager, MLOps and Data-Management

    Project mention: [D] Drop your best open source Deep learning related Project | reddit.com/r/MachineLearning | 2021-12-30

    Hi there. ClearML is our open-source solution which is part of the PyTorch ecosystem. We would really appreciate it if you read our README and starred us if you like what you see!

  • Scout APM

    Less time debugging, more time building. Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.

  • GitHub repo evidently

    Interactive reports to analyze machine learning models during validation or production monitoring.

    Project mention: The hand-picked selection of the best Python libraries released in 2021 | reddit.com/r/Python | 2021-12-21

    Evidently.

  • GitHub repo flyte

    Kubernetes-native workflow automation platform for complex, mission-critical data and ML processes at scale. It has been battle-tested at Lyft, Spotify, Freenome, and others and is truly open-source.

    Project mention: Hacktoberfest: Flytesnacks Project "update tuple output examples" | dev.to | 2021-11-01

    I chose the flytekit project, which is one of the component repos of flyte and is the python SDK and tools of the Flyte project

  • GitHub repo zenml

    ZenML 🙏: MLOps framework to create reproducible pipelines.

    Project mention: ZenML helps data scientists work across the full stack | news.ycombinator.com | 2022-01-05
  • GitHub repo budgetml

    Deploy a ML inference service on a budget in less than 10 lines of code.

    Project mention: Show HN: Deploy ML Models on a Budget | reddit.com/r/patient_hackernews | 2021-02-01
  • GitHub repo awesome-mlops

    :sunglasses: A curated list of awesome MLOps tools (by kelvins)

    Project mention: Run your first Kubeflow pipeline | dev.to | 2021-11-20

    Recently I've been learning MLOps. There's a lot to learn, as shown by this and this repository listing MLOps references and tools, respectively.

  • GitHub repo ploomber

    Write maintainable, production-ready pipelines using Jupyter or your favorite text editor. Develop locally, deploy to the cloud. ☁️

    Project mention: Simple workflow orchestration tool with Jupyter Notebook support | reddit.com/r/dataengineering | 2022-01-14

    Ploomber builds on top of papermill to provide a more streamlined experience, you can also open .py files as notebooks to have a nice git diff. https://github.com/ploomber/ploomber

  • GitHub repo deepchecks

    Test Suites for Validating ML Models & Data. Deepchecks is a Python package for comprehensively validating your machine learning models and data with minimal effort.

    Project mention: Test suites for machine learning models in Python (New OSS package) | dev.to | 2022-01-18

    And if you liked the project, we'll be delighted to count you as one of our stargazers at https://github.com/deepchecks/deepchecks/stargazers!

  • GitHub repo rubrix

    ✨A Python framework to label, refine, and monitor data for NLP projects

    Project mention: Rubrix: Open-source tool for building NLP training sets (now with weak supervision) | reddit.com/r/LanguageTechnology | 2022-01-19
  • GitHub repo ck

    Collective Knowledge framework (CK) provides a common set of automation recipes, APIs and meta descriptions to enable collaborative, reproducible and unified benchmarking and optimization of ML Systems across continuously changing models, data sets, software and hardware: (by mlcommons)

    Project mention: Research software code is likely to remain a tangled mess | news.ycombinator.com | 2021-02-22

    – Their solution product https://cknowledge.io/ and source code https://github.com/ctuning/ck\

    I guess it should be helpful to the researchers community.

  • GitHub repo bodywork

    ML pipeline orchestration and model deployments on Kubernetes, made really easy.

    Project mention: Deployment automation for ML projects of all shapes and sizes | news.ycombinator.com | 2021-06-09
  • GitHub repo popmon

    Monitor the stability of a pandas or spark dataframe ⚙︎

    Project mention: Monitor the stability of a pandas or spark dataframe | news.ycombinator.com | 2021-09-15
  • GitHub repo chitra

    A multi-functional library for full-stack Deep Learning. Simplifies Model Building, API development, and Model Deployment.

    Project mention: Answer: Resizing image and its bounding box | dev.to | 2021-07-03

    Another way of doing this is to use CHITRA

  • GitHub repo fastapi-template

    Completely Scalable FastAPI based template for Machine Learning, Deep Learning and any other software project which wants to use Fast API as an API framework.

    Project mention: Clean and Scalable Code Architecture for ML/DL and NLP driven micro-service | reddit.com/r/FastAPI | 2021-10-04

    Working on this opensource item to create a Clean/Scalable API for ML/DL-based projects using FastAPI. Have a look and please do give feedback: https://github.com/99sbr/fastapi-template

  • GitHub repo graphsignal

    Graphsignal Logger

    Project mention: [P] Model Performance Monitoring in Production | reddit.com/r/MachineLearning | 2021-11-01

    And the logger repo is https://github.com/graphsignal/graphsignal.

  • GitHub repo ml-template-azure

    Template for getting started with automated ML Ops on Azure Machine Learning

    Project mention: [D] How is MLOps done in your current workplace? | reddit.com/r/mlops | 2021-11-02
  • GitHub repo dbx

    CLI tool for advanced Databricks jobs management.

    Project mention: Anyone use Pyspark notebook in production ? | reddit.com/r/dataengineering | 2021-12-19
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2022-01-19.

Python Mlops related posts

Index

What are some of the best open-source Mlops projects in Python? This list will help you:

Project Stars
1 label-studio 7,435
2 great_expectations 5,921
3 metaflow 5,178
4 Kedro 4,829
5 Activeloop Hub 4,200
6 BentoML 3,132
7 clearml 2,932
8 evidently 2,072
9 flyte 1,853
10 zenml 1,593
11 budgetml 1,252
12 awesome-mlops 999
13 ploomber 925
14 deepchecks 901
15 rubrix 723
16 ck 452
17 bodywork 316
18 popmon 221
19 chitra 170
20 fastapi-template 111
21 graphsignal 102
22 ml-template-azure 80
23 dbx 75
Find remote jobs at our new job board 99remotejobs.com. There are 29 new remote jobs listed recently.
Are you hiring? Post a new remote job listing for free.
OPS - Build and Run Open Source Unikernels
Quickly and easily build and deploy open source unikernels in tens of seconds. Deploy in any language to any cloud.
github.com/nanovms