Python Pipeline

Open-source Python projects categorized as Pipeline | Edit details

Top 20 Python Pipeline Projects

  • GitHub repo great_expectations

    Always know what to expect from your data.

    Project mention: [P] Deepchecks: an open-source tool for high standards validations for ML models and data. | reddit.com/r/MachineLearning | 2022-01-06
  • GitHub repo Kedro

    A Python framework for creating reproducible, maintainable and modular data science code.

    Project mention: [Discussion] Applied machine learning implementation debate. Is OOP approach towards data preprocessing in python an overkill? | reddit.com/r/MachineLearning | 2021-11-03

    I'd focus more on understanding the issues in depth, before jumping to a solution. Otherwise, you would be adding hassle with some - bluntly speaking - opinionated and inflexible boilerplate code which not many people will like using. You mention some issues: non-obvious to understand code and hard to execute and replicate. Bad code which is not following engineering best practices (ideas from SOLID etc.) does not get better if you force the author to introduce certain classes. You can suggest some basics (e.g. common code formatter, meaningful variables names, short functions, no hard-coded values, ...), but I'm afraid you cannot educate non-engineers in a single day workshop. I would not focus on that at first. However, there is no excuse for writing bad code and then expecting others to fix. As you say, data engineering is part of data science skills, you are "junior" if you cannot write reproducible code. Being hard to execute and replicate is theoretically easy to fix. Force everyone to (at least hypothetically) submit their code into a testing environment where it will be automatically executed on a fresh machine. This will mean that at first they have to exactly specify all libraries that need to be installed. Second, they need to externalize all configuration - in particular data input and data output paths. Not a single value should be hard-coded in code! And finally they need a *single* command which can be run to execute the whole(!) pipeline. If they fail on any of these parts... they should try again. Work that does not pass this test is considered unfinished by the author. Basically you are introducing an automated, infallible test. Regarding your code, I'd really not try that direction. In particular even these few lines already look unclear and over-engineered. The csv format is already hard-coded into the code. If it changes to parquet you'd have to touch the code. The processing object has data paths fixed for which is no reason in a job which should take care of pure processing. Export data is also not something that a processing job should handle. And what if you have multiple input and output data? You would not have all these issues if you had kept to most simple solution to have a function `process(data1, data2, ...) -> result_data` where dataframes are passed in and out. It would also mean to have zero additional libraries or boilerplate. I highly doubt that a function `main_pipe(...)` will fix the malpractices some people may do. There are two small feature which are useful beyond a plain function though: automatically generating a visual DAG from the code and quick checking if input requirements are satisfied before heavy code is run. You can still put any mature DAG library on top, which probably already includes experience from a lot of developers. Not need to rewrite that. I'm not sure which one is best (metaflow, luigi, airflow, ... https://github.com/pditommaso/awesome-pipeline no idea), but many come with a lot of features. If you want a bit more scaffolding to easier understand foreign projects, you could look at https://github.com/quantumblacklabs/kedro but maybe that's already too much. Fix the "single command replication-from-scratch requirement" first.

  • SonarLint

    Deliver Cleaner and Safer Code - Right in Your IDE of Choice!. SonarLint is a free and open source IDE extension that identifies and catches bugs and vulnerabilities as you code, directly in the IDE. Install from your favorite IDE marketplace today.

  • GitHub repo papermill

    📚 Parameterize, execute, and analyze notebooks

    Project mention: Release of IPython 8.0 | news.ycombinator.com | 2022-01-12

    - We mostly use notebooks as scratchpads or alpha prototypes.

    - Papermill is a great tool when setting up a scheduled notebook and then shipping the output to S3: https://papermill.readthedocs.io/en/latest/

    - When turning notebooks into more user-facing prototypes, I've found Streamlit is excellent for ship something really fast. Some of these prototypes have stuck around as Streamlit apps when there's 1-3 users who need to use them regularly.

    - Moving to full-blown apps is much tougher and time-consuming.

  • GitHub repo PyFunctional

    Python library for creating data pipelines with chain functional programming

    Project mention: PyFunctional makes creating data pipelines easy by using chained functional operators | reddit.com/r/Python | 2021-03-31
  • GitHub repo mara-pipelines

    A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow

    Project mention: How to keep track of the different Transformations done in an ETL pipeline? | reddit.com/r/dataengineering | 2021-08-22

    The closest I've found is Mara but not what I'm after.

  • GitHub repo galaxy

    Data intensive science for everyone. (by galaxyproject)

    Project mention: Developed a new kind of dual extruder system on fully custom built 3D printer | reddit.com/r/3Dprinting | 2021-03-01
  • GitHub repo LightAutoML

    LAMA - automatic model creation framework

    Project mention: Github Discussion: What is your favorite Data Science Repo? | reddit.com/r/datascience | 2021-07-24
  • Scout APM

    Less time debugging, more time building. Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.

  • GitHub repo pdpipe

    Easy pipelines for pandas DataFrames.

  • GitHub repo whispers

    Identify hardcoded secrets in static structured text (by Skyscanner)

    Project mention: Skyscanner/whispers - Identify hardcoded secrets and dangerous behaviours | reddit.com/r/GithubSecurityTools | 2021-10-07
  • GitHub repo bodywork

    ML pipeline orchestration and model deployments on Kubernetes, made really easy.

    Project mention: Deployment automation for ML projects of all shapes and sizes | news.ycombinator.com | 2021-06-09
  • GitHub repo pypyr automation task runner

    pypyr task-runner cli & api for automation pipelines. Automate anything by combining commands, different scripts in different languages & applications into one pipeline process.

    Project mention: Comparison of Python TOML parser libraries | dev.to | 2021-12-14

    The pypyr automation pipeline task-runner open-source project recently added TOML parsing & writing functionality as a core feature. To this end, I researched the available free & open-source Python TOML parser libraries to figure out which option to use.

  • GitHub repo karton

    Distributed malware processing framework based on Python, Redis and MinIO.

    Project mention: Using a Virtual Machine to Isolate and Test Files for Malware | reddit.com/r/vmware | 2022-01-13

    I did something along the lines of what you describe at work. The easiest way to check files is of course uploading their hashes to virustotal (it's free!) but if you still want to set up an automated malware analysis lab then VMware is a decent choice. You should have a resonably beefy VM (at least 16 gb of ram, couple of cpu cores, rather large ROM also make sure you expose hardware virtualization to this guest). You want the machine to have a bit better specs than a regular windows pc - that way malware won't think "Oh hey, this computer I am on has suspiciously low specs - it's probably a VM! Better delete myself to hinder any threat hunting efforts". On that machine you should install a linux distro - ubuntu for example. Then on this linux you should install a sandbox - for example Cuckoo (it works well on Vsphere, Esxi guests). I know there exist other sandbox software but I worked with this one and it performed alright. Installing and configuring Cuckoo is a bit more involved than I'd like to get into in this comment but I'm sure you will figure this out with numerous tutorials and documentation pages available. Take a look at Volatility framework too! For automating you might want to check out Karton Framework (https://github.com/CERT-Polska/karton) . I haven't used it but I had the chance to talk to its authors and it seems dope.

  • GitHub repo forte

    Forte is a flexible and powerful NLP builder FOR TExt. This is part of the CASL project: http://casl-project.ai/

    Project mention: Building Modular and Re-purposable NLP Pipelines | reddit.com/r/learnmachinelearning | 2021-03-02

    Introducing Forte, from the CASL open-source project at Petuum. Forte combines multiple NLP tools to construct an entire NLP pipeline with a few lines of python and extend them to different domains.

  • GitHub repo pierogis

    image and animation processing framework

    Project mention: pierogis/pierogis a framework for image and animation processing | reddit.com/r/Python | 2021-02-22
  • GitHub repo cool

    Make Python code cooler. Less is more. (by abersheeran)

    Project mention: Simple, efficient and pure Python implementation of Python pipeline operations | reddit.com/r/Python | 2021-05-17
  • GitHub repo healthsea

    Healthsea is a spaCy pipeline for analyzing user reviews of supplementary products for their effects on health.

    Project mention: I built an NLP pipeline for analyzing supplement reviews called Healthsea 🐳 | reddit.com/r/Python | 2022-01-06

    Github: https://github.com/explosion/healthsea

  • GitHub repo alkymi

    Pythonic task automation

    Project mention: Alkymi – Data/Task Automation in Python | reddit.com/r/programming | 2021-03-23
  • GitHub repo spline

    Spline is a tool that is capable of running locally as well as part of well known pipelines like Jenkins (Jenkinsfile), Travis CI (.travis.yml) or similar ones. (by Nachtfeuer)

  • GitHub repo abcd-hcp-pipeline

    bids application for processing functional MRI data, robust to scanner, acquisition and age variability.

    Project mention: Siemens output from ABCD T1 and T2 sequences. | reddit.com/r/neuroscience | 2021-02-08

    Who provided the sequence? They're usually the point of contact for this kind of question. Alternatively, you can bug one of the processing groups for ABCD (link, and they might point you in the right direction. A shot of getting one of the ABCD or ABIDE/HCP sequence designers to see this on reddit is unlikley, but good luck.

  • GitHub repo magda

    Library for building Modular and Asynchronous Graphs with Directed and Acyclic edges (MAGDA)

    Project mention: MAGDA – our open-source solution for spaghetti code | dev.to | 2021-04-14

    We would like to introduce you to our latest open-source library: MAGDA. The name is an abbreviation for “Modular Asynchronous Graphs with Directed and Acyclic edges”, which fully describes the idea behind it. The library enables building modular data pipelines with asynchronous processing in e.g. machine learning and data science projects. It is dedicated for Python projects and is available on the NeuroSYS GitHub, as well as on the PyPI repository. It aids our R&D teams not only by introducing some abstraction (classes and functions) but also by imposing an architectural pattern onto the project.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2022-01-13.

Python Pipeline related posts

Index

What are some of the best open-source Pipeline projects in Python? This list will help you:

Project Stars
1 great_expectations 5,921
2 Kedro 4,829
3 papermill 4,499
4 PyFunctional 1,961
5 mara-pipelines 1,850
6 galaxy 909
7 LightAutoML 668
8 pdpipe 640
9 whispers 326
10 bodywork 316
11 pypyr automation task runner 254
12 karton 225
13 forte 154
14 pierogis 109
15 cool 93
16 healthsea 49
17 alkymi 38
18 spline 31
19 abcd-hcp-pipeline 17
20 magda 10
Find remote jobs at our new job board 99remotejobs.com. There are 29 new remote jobs listed recently.
Are you hiring? Post a new remote job listing for free.
OPS - Build and Run Open Source Unikernels
Quickly and easily build and deploy open source unikernels in tens of seconds. Deploy in any language to any cloud.
github.com/nanovms