Top 20 Python Pipeline Projects
Always know what to expect from your data.Project mention: [P] Deepchecks: an open-source tool for high standards validations for ML models and data. | reddit.com/r/MachineLearning | 2022-01-06
A Python framework for creating reproducible, maintainable and modular data science code.Project mention: [Discussion] Applied machine learning implementation debate. Is OOP approach towards data preprocessing in python an overkill? | reddit.com/r/MachineLearning | 2021-11-03
I'd focus more on understanding the issues in depth, before jumping to a solution. Otherwise, you would be adding hassle with some - bluntly speaking - opinionated and inflexible boilerplate code which not many people will like using. You mention some issues: non-obvious to understand code and hard to execute and replicate. Bad code which is not following engineering best practices (ideas from SOLID etc.) does not get better if you force the author to introduce certain classes. You can suggest some basics (e.g. common code formatter, meaningful variables names, short functions, no hard-coded values, ...), but I'm afraid you cannot educate non-engineers in a single day workshop. I would not focus on that at first. However, there is no excuse for writing bad code and then expecting others to fix. As you say, data engineering is part of data science skills, you are "junior" if you cannot write reproducible code. Being hard to execute and replicate is theoretically easy to fix. Force everyone to (at least hypothetically) submit their code into a testing environment where it will be automatically executed on a fresh machine. This will mean that at first they have to exactly specify all libraries that need to be installed. Second, they need to externalize all configuration - in particular data input and data output paths. Not a single value should be hard-coded in code! And finally they need a *single* command which can be run to execute the whole(!) pipeline. If they fail on any of these parts... they should try again. Work that does not pass this test is considered unfinished by the author. Basically you are introducing an automated, infallible test. Regarding your code, I'd really not try that direction. In particular even these few lines already look unclear and over-engineered. The csv format is already hard-coded into the code. If it changes to parquet you'd have to touch the code. The processing object has data paths fixed for which is no reason in a job which should take care of pure processing. Export data is also not something that a processing job should handle. And what if you have multiple input and output data? You would not have all these issues if you had kept to most simple solution to have a function `process(data1, data2, ...) -> result_data` where dataframes are passed in and out. It would also mean to have zero additional libraries or boilerplate. I highly doubt that a function `main_pipe(...)` will fix the malpractices some people may do. There are two small feature which are useful beyond a plain function though: automatically generating a visual DAG from the code and quick checking if input requirements are satisfied before heavy code is run. You can still put any mature DAG library on top, which probably already includes experience from a lot of developers. Not need to rewrite that. I'm not sure which one is best (metaflow, luigi, airflow, ... https://github.com/pditommaso/awesome-pipeline no idea), but many come with a lot of features. If you want a bit more scaffolding to easier understand foreign projects, you could look at https://github.com/quantumblacklabs/kedro but maybe that's already too much. Fix the "single command replication-from-scratch requirement" first.
Deliver Cleaner and Safer Code - Right in Your IDE of Choice!. SonarLint is a free and open source IDE extension that identifies and catches bugs and vulnerabilities as you code, directly in the IDE. Install from your favorite IDE marketplace today.
📚 Parameterize, execute, and analyze notebooksProject mention: Release of IPython 8.0 | news.ycombinator.com | 2022-01-12
- We mostly use notebooks as scratchpads or alpha prototypes.
- Papermill is a great tool when setting up a scheduled notebook and then shipping the output to S3: https://papermill.readthedocs.io/en/latest/
- When turning notebooks into more user-facing prototypes, I've found Streamlit is excellent for ship something really fast. Some of these prototypes have stuck around as Streamlit apps when there's 1-3 users who need to use them regularly.
- Moving to full-blown apps is much tougher and time-consuming.
Python library for creating data pipelines with chain functional programmingProject mention: PyFunctional makes creating data pipelines easy by using chained functional operators | reddit.com/r/Python | 2021-03-31
A lightweight opinionated ETL framework, halfway between plain scripts and Apache AirflowProject mention: How to keep track of the different Transformations done in an ETL pipeline? | reddit.com/r/dataengineering | 2021-08-22
The closest I've found is Mara but not what I'm after.
Data intensive science for everyone. (by galaxyproject)Project mention: Developed a new kind of dual extruder system on fully custom built 3D printer | reddit.com/r/3Dprinting | 2021-03-01
LAMA - automatic model creation frameworkProject mention: Github Discussion: What is your favorite Data Science Repo? | reddit.com/r/datascience | 2021-07-24
Less time debugging, more time building. Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.
Easy pipelines for pandas DataFrames.
Identify hardcoded secrets in static structured text (by Skyscanner)Project mention: Skyscanner/whispers - Identify hardcoded secrets and dangerous behaviours | reddit.com/r/GithubSecurityTools | 2021-10-07
ML pipeline orchestration and model deployments on Kubernetes, made really easy.Project mention: Deployment automation for ML projects of all shapes and sizes | news.ycombinator.com | 2021-06-09
pypyr task-runner cli & api for automation pipelines. Automate anything by combining commands, different scripts in different languages & applications into one pipeline process.Project mention: Comparison of Python TOML parser libraries | dev.to | 2021-12-14
The pypyr automation pipeline task-runner open-source project recently added TOML parsing & writing functionality as a core feature. To this end, I researched the available free & open-source Python TOML parser libraries to figure out which option to use.
Distributed malware processing framework based on Python, Redis and MinIO.Project mention: Using a Virtual Machine to Isolate and Test Files for Malware | reddit.com/r/vmware | 2022-01-13
I did something along the lines of what you describe at work. The easiest way to check files is of course uploading their hashes to virustotal (it's free!) but if you still want to set up an automated malware analysis lab then VMware is a decent choice. You should have a resonably beefy VM (at least 16 gb of ram, couple of cpu cores, rather large ROM also make sure you expose hardware virtualization to this guest). You want the machine to have a bit better specs than a regular windows pc - that way malware won't think "Oh hey, this computer I am on has suspiciously low specs - it's probably a VM! Better delete myself to hinder any threat hunting efforts". On that machine you should install a linux distro - ubuntu for example. Then on this linux you should install a sandbox - for example Cuckoo (it works well on Vsphere, Esxi guests). I know there exist other sandbox software but I worked with this one and it performed alright. Installing and configuring Cuckoo is a bit more involved than I'd like to get into in this comment but I'm sure you will figure this out with numerous tutorials and documentation pages available. Take a look at Volatility framework too! For automating you might want to check out Karton Framework (https://github.com/CERT-Polska/karton) . I haven't used it but I had the chance to talk to its authors and it seems dope.
Forte is a flexible and powerful NLP builder FOR TExt. This is part of the CASL project: http://casl-project.ai/Project mention: Building Modular and Re-purposable NLP Pipelines | reddit.com/r/learnmachinelearning | 2021-03-02
Introducing Forte, from the CASL open-source project at Petuum. Forte combines multiple NLP tools to construct an entire NLP pipeline with a few lines of python and extend them to different domains.
image and animation processing frameworkProject mention: pierogis/pierogis a framework for image and animation processing | reddit.com/r/Python | 2021-02-22
Make Python code cooler. Less is more. (by abersheeran)Project mention: Simple, efficient and pure Python implementation of Python pipeline operations | reddit.com/r/Python | 2021-05-17
Healthsea is a spaCy pipeline for analyzing user reviews of supplementary products for their effects on health.Project mention: I built an NLP pipeline for analyzing supplement reviews called Healthsea 🐳 | reddit.com/r/Python | 2022-01-06
Pythonic task automationProject mention: Alkymi – Data/Task Automation in Python | reddit.com/r/programming | 2021-03-23
Spline is a tool that is capable of running locally as well as part of well known pipelines like Jenkins (Jenkinsfile), Travis CI (.travis.yml) or similar ones. (by Nachtfeuer)
bids application for processing functional MRI data, robust to scanner, acquisition and age variability.Project mention: Siemens output from ABCD T1 and T2 sequences. | reddit.com/r/neuroscience | 2021-02-08
Who provided the sequence? They're usually the point of contact for this kind of question. Alternatively, you can bug one of the processing groups for ABCD (link, and they might point you in the right direction. A shot of getting one of the ABCD or ABIDE/HCP sequence designers to see this on reddit is unlikley, but good luck.
Library for building Modular and Asynchronous Graphs with Directed and Acyclic edges (MAGDA)Project mention: MAGDA – our open-source solution for spaghetti code | dev.to | 2021-04-14
We would like to introduce you to our latest open-source library: MAGDA. The name is an abbreviation for “Modular Asynchronous Graphs with Directed and Acyclic edges”, which fully describes the idea behind it. The library enables building modular data pipelines with asynchronous processing in e.g. machine learning and data science projects. It is dedicated for Python projects and is available on the NeuroSYS GitHub, as well as on the PyPI repository. It aids our R&D teams not only by introducing some abstraction (classes and functions) but also by imposing an architectural pattern onto the project.
Python Pipeline related posts
[Discussion] Applied machine learning implementation debate. Is OOP approach towards data preprocessing in python an overkill?
3 projects | reddit.com/r/MachineLearning | 3 Nov 2021
Noobie who is trying to use K8s needs confirmation to know if this is the way or he is overestimating Kubernetes.
3 projects | reddit.com/r/kubernetes | 20 Oct 2021
Skyscanner/whispers - Identify hardcoded secrets and dangerous behaviours
1 project | reddit.com/r/GithubSecurityTools | 7 Oct 2021
Creating new Data Pipelines from the command line
1 project | dev.to | 27 Sep 2021
Py Framework for creating reproducible, maintainable, modular datascience code
1 project | news.ycombinator.com | 13 Sep 2021
Github Discussion: What is your favorite Data Science Repo?
4 projects | reddit.com/r/datascience | 24 Jul 2021
I Started Streaming on Twitch
1 project | dev.to | 12 Jun 2021
What are some of the best open-source Pipeline projects in Python? This list will help you:
|11||pypyr automation task runner||254|
Are you hiring? Post a new remote job listing for free.