dataops

Top 23 dataops Open-Source Projects

  1. cleanlab

    The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.

    Project mention: Ask HN: Not a webdev, why are these sites so good? | news.ycombinator.com | 2024-06-18

    https://cleanlab.ai/

  2. Judoscale

    Save 47% on cloud hosting with autoscaling that just works. Judoscale integrates with Django, FastAPI, Celery, and RQ to make autoscaling easy and reliable. Save big, and say goodbye to request timeouts and backed-up task queues.

    Judoscale logo
  3. flyte

    Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.

    Project mention: Boost your ML pipeline performance with efficient parallelism | dev.to | 2025-04-09

    Flyte is a distributed computation framework that uses a Kubernetes Pod as the fundamental execution environment for each task in a pipeline. When you use MapTasks, Flyte automatically distributes the load among multiple Pods that run in parallel and limits each Pod to downloading and processing only a specific index from the inputs list, preventing inefficient duplicate data movement.

  4. lance

    Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..

    Project mention: ⚡🦀 Deploy a blazing-fast & Lightweight LLM app with Rust-Rig-LanceDB | dev.to | 2024-11-22

    Lance is an open-source columnar data format designed for performant ML workloads.

  5. console

    Redpanda Console is a developer-friendly UI for managing your Kafka/Redpanda workloads. Console gives you a simple, interactive approach for gaining visibility into your topics, masking data, managing consumer groups, and exploring real-time data with time-travel debugging. (by redpanda-data)

    Project mention: AutoMQ Integration with Redpanda Console | dev.to | 2024-07-29

    cd /opt sudo curl -L -o redpanda_console.tar.gz https://github.com/redpanda-data/console/releases/download/v2.6.0/redpanda_console_2.6.0_linux_amd64.tar.gz

  6. whylogs

    An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collection, ensuring safety & robustness. 📈

  7. sqlmesh

    Scalable and efficient data transformation framework - backwards compatible with dbt.

  8. elementary

    The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.

  9. InfluxDB

    InfluxDB high-performance time series database. Collect, organize, and act on massive volumes of high-resolution data to power real-time intelligent systems.

    InfluxDB logo
  10. fast-data-dev

    Kafka Docker for development. Kafka, Zookeeper, Schema Registry, Kafka-Connect, , 20+ connectors

    Project mention: Ask HN: Who is hiring? (February 2025) | news.ycombinator.com | 2025-02-03

    Lenses.io | Senior Go (Golang) Developer | Full-Time | REMOTE (UK, Germany, Spain) | https://lenses.io

    Lenses.io is seeking two experienced Senior Go Developers to play a pivotal role in reshaping our cutting-edge streaming data platforms. Join our dynamic team dedicated to improve the developer experience (DevEx) in real-time data processing.

    If you are passionate about driving innovation and have a knack for Go development, this role might be the perfect fit for you. Showcase your expertise by contributing to the transformation of our product. If you are ready to embark on this exciting journey with us, share your CV via email at [email protected].

    Your Responsibilities:

    - Architect & Develop: Lead the redesign and enhancement of our streaming data platform utilizing Go.

    - Innovate: Elevate our Kafka-based solutions to ensure optimal performance and scalability.

    - Collaborate: Engage with diverse teams to shape and implement new features seamlessly.

    Experience:

    - Demonstrated proficiency in Go development.

    - Strong track record in constructing and managing scalable data systems.

    Technical Skills:

    - Deep knowledge of Kafka and associated streaming technologies is advantageous.

    - Familiarity with cloud platforms (AWS, GCP, Azure) and microservices architecture is valued.

  11. meltano

    Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.

  12. SREWorks

    Cloud Native DataOps & AIOps Platform | 云原生数智运维平台

  13. awesome-data-catalogs

    📙 Awesome Data Catalogs and Observability Platforms.

  14. optimus

    Optimus is an easy-to-use, reliable, and performant workflow orchestrator for data transformation, data modeling, pipelines, and data quality management. (by raystack)

  15. tenzir

    Tenzir is the data pipeline engine for security teams.

  16. awesome-data-annotation

    A list of tools for annotating data, managing annotations, etc.

    Project mention: best-data-labeling-tools VS awesome-data-annotation - a user suggested alternative | libhunt.com/r/best-data-labeling-tools | 2024-11-13

    Great list of data annotation tools! One area that could be further explored is AI-assisted labeling, which can speed up the process and improve accuracy. For example, tools like Labellerr and Amazon SageMaker Ground Truth use AI to assist human annotators, making large-scale annotation more efficient. I also appreciate the inclusion of open-source tools like MakeSense and Label Studio, which are customizable for specific workflows. Keep up the great work—this is a valuable resource for anyone working in data annotation!

  17. versatile-data-kit

    One framework to develop, deploy and operate data workflows with Python and SQL.

  18. pbi-tools

    Power BI DevOps & Source Control Tool

  19. firehose

    Firehose is an extensible, no-code, and cloud-native service to load real-time streaming data from Kafka to data stores, data lakes, and analytical storage systems. (by raystack)

  20. squirrel-core

    A Python library that enables ML teams to share, load, and transform data in a collaborative, flexible, and efficient way :chestnut:

  21. dagger

    Dagger is an easy-to-use, configuration over code, cloud-native framework built on top of Apache Flink for stateful processing of real-time streaming data. (by raystack)

  22. raccoon

    Raccoon is a high-throughput, low-latency service to collect events in real-time from your web, mobile apps, and services using multiple network protocols. (by raystack)

  23. meteor

    Meteor is an easy-to-use, plugin-driven metadata collection framework to extract data from different sources and sink to any data catalog. (by raystack)

  24. space

    Unified storage framework for the entire machine learning lifecycle (by google)

  25. guardian

    Guardian is universal data access management tool with automated access workflows and security controls across data stores, analytical systems, and cloud products. (by raystack)

  26. CodeRabbit

    CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.

    CodeRabbit logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

dataops discussion

Log in or Post with

dataops related posts

  • Launch HN: Serra (YC S23) – Open-source, Python-based dbt alternative

    4 projects | news.ycombinator.com | 14 Aug 2023
  • meltano VS cloudquery - a user suggested alternative

    2 projects | 2 Jun 2023
  • Show HN: Meltano Cloud (Gitlab spinout) – Managed infra for open source ELT

    1 project | news.ycombinator.com | 1 Jun 2023
  • DBT lays off 15% of their staff

    1 project | /r/dataengineering | 19 May 2023
  • SQL Mesh - Auto DAG generation!!

    1 project | /r/HoneyCombAI | 14 May 2023
  • SQL Mesh - Auto DAG generation!!

    1 project | /r/HoneyCombAI | 14 May 2023
  • Data transformation tools other than DBT

    1 project | /r/dataengineering | 14 May 2023
  • A note from our sponsor - CodeRabbit
    coderabbit.ai | 24 Apr 2025
    Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR. Learn more →

Index

What are some of the best open-source dataops projects? This list will help you:

# Project Stars
1 cleanlab 10,477
2 flyte 6,182
3 lance 4,472
4 console 3,988
5 whylogs 2,708
6 sqlmesh 2,256
7 elementary 2,048
8 fast-data-dev 2,039
9 meltano 2,034
10 SREWorks 1,867
11 awesome-data-catalogs 829
12 optimus 749
13 tenzir 671
14 awesome-data-annotation 596
15 versatile-data-kit 446
16 pbi-tools 349
17 firehose 336
18 squirrel-core 281
19 dagger 271
20 raccoon 209
21 meteor 205
22 space 155
23 guardian 136

Sponsored
Save 47% on cloud hosting with autoscaling that just works
Judoscale integrates with Django, FastAPI, Celery, and RQ to make autoscaling easy and reliable. Save big, and say goodbye to request timeouts and backed-up task queues.
judoscale.com

Did you know that Python is
the 2nd most popular programming language
based on number of references?