Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR. Learn more →
Top 23 dataops Open-Source Projects
-
cleanlab
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
Project mention: Ask HN: Not a webdev, why are these sites so good? | news.ycombinator.com | 2024-06-18https://cleanlab.ai/
-
Judoscale
Save 47% on cloud hosting with autoscaling that just works. Judoscale integrates with Django, FastAPI, Celery, and RQ to make autoscaling easy and reliable. Save big, and say goodbye to request timeouts and backed-up task queues.
-
flyte
Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.
Project mention: Boost your ML pipeline performance with efficient parallelism | dev.to | 2025-04-09Flyte is a distributed computation framework that uses a Kubernetes Pod as the fundamental execution environment for each task in a pipeline. When you use MapTasks, Flyte automatically distributes the load among multiple Pods that run in parallel and limits each Pod to downloading and processing only a specific index from the inputs list, preventing inefficient duplicate data movement.
-
lance
Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
Project mention: ⚡🦀 Deploy a blazing-fast & Lightweight LLM app with Rust-Rig-LanceDB | dev.to | 2024-11-22Lance is an open-source columnar data format designed for performant ML workloads.
-
console
Redpanda Console is a developer-friendly UI for managing your Kafka/Redpanda workloads. Console gives you a simple, interactive approach for gaining visibility into your topics, masking data, managing consumer groups, and exploring real-time data with time-travel debugging. (by redpanda-data)
cd /opt sudo curl -L -o redpanda_console.tar.gz https://github.com/redpanda-data/console/releases/download/v2.6.0/redpanda_console_2.6.0_linux_amd64.tar.gz
-
whylogs
An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collection, ensuring safety & robustness. 📈
-
-
elementary
The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.
-
InfluxDB
InfluxDB high-performance time series database. Collect, organize, and act on massive volumes of high-resolution data to power real-time intelligent systems.
-
fast-data-dev
Kafka Docker for development. Kafka, Zookeeper, Schema Registry, Kafka-Connect, , 20+ connectors
Lenses.io | Senior Go (Golang) Developer | Full-Time | REMOTE (UK, Germany, Spain) | https://lenses.io
Lenses.io is seeking two experienced Senior Go Developers to play a pivotal role in reshaping our cutting-edge streaming data platforms. Join our dynamic team dedicated to improve the developer experience (DevEx) in real-time data processing.
If you are passionate about driving innovation and have a knack for Go development, this role might be the perfect fit for you. Showcase your expertise by contributing to the transformation of our product. If you are ready to embark on this exciting journey with us, share your CV via email at [email protected].
Your Responsibilities:
- Architect & Develop: Lead the redesign and enhancement of our streaming data platform utilizing Go.
- Innovate: Elevate our Kafka-based solutions to ensure optimal performance and scalability.
- Collaborate: Engage with diverse teams to shape and implement new features seamlessly.
Experience:
- Demonstrated proficiency in Go development.
- Strong track record in constructing and managing scalable data systems.
Technical Skills:
- Deep knowledge of Kafka and associated streaming technologies is advantageous.
- Familiarity with cloud platforms (AWS, GCP, Azure) and microservices architecture is valued.
-
meltano
Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.
-
-
-
optimus
Optimus is an easy-to-use, reliable, and performant workflow orchestrator for data transformation, data modeling, pipelines, and data quality management. (by raystack)
-
-
Project mention: best-data-labeling-tools VS awesome-data-annotation - a user suggested alternative | libhunt.com/r/best-data-labeling-tools | 2024-11-13
Great list of data annotation tools! One area that could be further explored is AI-assisted labeling, which can speed up the process and improve accuracy. For example, tools like Labellerr and Amazon SageMaker Ground Truth use AI to assist human annotators, making large-scale annotation more efficient. I also appreciate the inclusion of open-source tools like MakeSense and Label Studio, which are customizable for specific workflows. Keep up the great work—this is a valuable resource for anyone working in data annotation!
-
-
-
firehose
Firehose is an extensible, no-code, and cloud-native service to load real-time streaming data from Kafka to data stores, data lakes, and analytical storage systems. (by raystack)
-
squirrel-core
A Python library that enables ML teams to share, load, and transform data in a collaborative, flexible, and efficient way :chestnut:
-
dagger
Dagger is an easy-to-use, configuration over code, cloud-native framework built on top of Apache Flink for stateful processing of real-time streaming data. (by raystack)
-
raccoon
Raccoon is a high-throughput, low-latency service to collect events in real-time from your web, mobile apps, and services using multiple network protocols. (by raystack)
-
meteor
Meteor is an easy-to-use, plugin-driven metadata collection framework to extract data from different sources and sink to any data catalog. (by raystack)
-
-
guardian
Guardian is universal data access management tool with automated access workflows and security controls across data stores, analytical systems, and cloud products. (by raystack)
-
CodeRabbit
CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
dataops discussion
dataops related posts
-
Launch HN: Serra (YC S23) – Open-source, Python-based dbt alternative
-
meltano VS cloudquery - a user suggested alternative
2 projects | 2 Jun 2023 -
Show HN: Meltano Cloud (Gitlab spinout) – Managed infra for open source ELT
-
DBT lays off 15% of their staff
-
SQL Mesh - Auto DAG generation!!
-
SQL Mesh - Auto DAG generation!!
-
Data transformation tools other than DBT
-
A note from our sponsor - CodeRabbit
coderabbit.ai | 24 Apr 2025
Index
What are some of the best open-source dataops projects? This list will help you:
# | Project | Stars |
---|---|---|
1 | cleanlab | 10,477 |
2 | flyte | 6,182 |
3 | lance | 4,472 |
4 | console | 3,988 |
5 | whylogs | 2,708 |
6 | sqlmesh | 2,256 |
7 | elementary | 2,048 |
8 | fast-data-dev | 2,039 |
9 | meltano | 2,034 |
10 | SREWorks | 1,867 |
11 | awesome-data-catalogs | 829 |
12 | optimus | 749 |
13 | tenzir | 671 |
14 | awesome-data-annotation | 596 |
15 | versatile-data-kit | 446 |
16 | pbi-tools | 349 |
17 | firehose | 336 |
18 | squirrel-core | 281 |
19 | dagger | 271 |
20 | raccoon | 209 |
21 | meteor | 205 |
22 | space | 155 |
23 | guardian | 136 |