Big Data

Open-source projects categorized as Big Data

Top 23 Big Data Open-Source Projects

  • awesome-scalability

    The Patterns of Scalable, Reliable, and Performant Large-Scale Systems

  • Apache Spark

    Apache Spark - A unified analytics engine for large-scale data processing

  • Project mention: "xAI will open source Grok" | news.ycombinator.com | 2024-03-11
  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • ClickHouse

    ClickHouse® is a free analytics DBMS for big data

  • Project mention: We Built a 19 PiB Logging Platform with ClickHouse and Saved Millions | news.ycombinator.com | 2024-04-02

    Yes, we are working on it! :) Taking some of the learnings from current experimental JSON Object datatype, we are now working on what will become the production-ready implementation. Details here: https://github.com/ClickHouse/ClickHouse/issues/54864

    Variant datatype is already available as experimental in 24.1, Dynamic datatype is WIP (PR almost ready), and JSON datatype is next up. Check out the latest comment on that issue with how the Dynamic datatype will work: https://github.com/ClickHouse/ClickHouse/issues/54864#issuec...

  • data-science-ipython-notebooks

    Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

  • Project mention: First 15 Open Source Advent projects | dev.to | 2023-12-15

    7. Apache Flink | Github | tutorial

  • gun

    An open source cybersecurity protocol for syncing decentralized graph data.

  • Project mention: gun: NEW Data - star count:17470.0 | /r/algoprojects | 2023-10-28
  • Presto

    The official home of the Presto distributed SQL query engine for big data

  • Project mention: Multi-Database Support in DuckDB | news.ycombinator.com | 2024-01-28

    We have some of this functionality in Presto (https://github.com/prestodb/presto), but it takes fair bit of work to implement it for all the different backends.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • QuestDB

    An open source time-series database for fast ingest and SQL queries

  • Project mention: How to Forecast Air Temperatures with AI + IoT Sensor Data | dev.to | 2024-03-24

    If your data lacks uniform time intervals between consecutive entries, QuestDB offers a solution by allowing you to sample your data. After that, MindsDB facilitates creating, training, and deploying your time-series models.

  • Cookbook

    The Data Engineering Cookbook

  • Project mention: Tranzitie catre data engineering | /r/programare | 2023-07-12

    https://github.com/andkret/Cookbook arunca un ochi aici. Omul are si youtube channel https://www.youtube.com/@andreaskayy

  • kafka-manager

    CMAK is a tool for managing Apache Kafka clusters

  • Project mention: FLaNK Stack Weekly 16 October 2023 | dev.to | 2023-10-17
  • NebulaGraph Database

    A distributed, fast open-source graph database featuring horizontal scalability and high availability (by vesoft-inc)

  • Trino

    Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

  • Project mention: Trino: Fast distributed SQL query engine for big data analytics | news.ycombinator.com | 2024-03-19
  • Cython

    The most widely used Python to C compiler

  • Project mention: Ask HN: C/C++ developer wanting to learn efficient Python | news.ycombinator.com | 2024-04-10
  • kafka-ui

    Open-Source Web UI for Apache Kafka Management

  • Project mention: FLaNK Stack Weekly 16 October 2023 | dev.to | 2023-10-17
  • catboost

    A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

  • Project mention: CatBoost: Open-source gradient boosting library | news.ycombinator.com | 2024-03-05
  • starrocks

    StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries. InfoWorld’s 2023 BOSSIE Award for best open source software.

  • Project mention: A MySQL compatible database engine written in pure Go | news.ycombinator.com | 2024-04-09

    tidb has been around for a while, it is distributed, written in Go and Rust, and MySQL compatible. https://github.com/pingcap/tidb

    Somewhat relatedly, StarRocks is also MySQL compatible, written in Java and C++, but it's tackling OLAP use-cases. https://github.com/StarRocks/starrocks

  • beam

    Apache Beam is a unified programming model for Batch and Streaming data processing.

  • Project mention: Ask HN: Does (or why does) anyone use MapReduce anymore? | news.ycombinator.com | 2024-01-24

    The "streaming systems" book answers your question and more: https://www.oreilly.com/library/view/streaming-systems/97814.... It gives you a history of how batch processing started with MapReduce, and how attempts at scaling by moving towards streaming systems gave us all the subsequent frameworks (Spark, Beam, etc.).

    As for the framework called MapReduce, it isn't used much, but its descendant https://beam.apache.org very much is. Nowadays people often use "map reduce" as a shorthand for whatever batch processing system they're building on top of.

  • delta

    An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)

  • Project mention: Delta Lake vs. Parquet: A Comparison | news.ycombinator.com | 2024-01-19

    Delta is pretty great, let's you do upserts into tables in DataBricks much easier than without it.

    I think the website is here: https://delta.io

  • H2O

    H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

  • Project mention: Really struggling with open source models | /r/LocalLLaMA | 2023-07-12

    I would use H20 if I were you. You can try out LLMs with a nice GUI. Unless you have some familiarity with the tools needed to run these projects, it can be frustrating. https://h2o.ai/

  • risingwave

    Cloud-native SQL stream processing, analytics, and management. KsqlDB and Apache Flink alternative. 🚀 10x more productive. 🚀 10x more cost-efficient.

  • Project mention: Proton, a fast and lightweight alternative to Apache Flink | news.ycombinator.com | 2024-01-30

    How does this compare to RisingWave and Materialize?

    https://github.com/risingwavelabs/risingwave

  • Zeppelin

    Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

  • Project mention: Serverless Apache Zeppelin on AWS | dev.to | 2024-02-04

    Now we can proceed with the definition of Apache Zeppelin. It is a web-based notebook that enables data-driven, interactive data analytics and collaborative documents with Python, Scala, SQL, Spark, and more. You can execute code and even schedule a job (via cron) to run at regular intervals.

  • arkime

    Arkime is an open source, large scale, full packet capturing, indexing, and database system.

  • pachyderm

    Data-Centric Pipelines and Data Versioning

  • Project mention: Open Source Advent Fun Wraps Up! | dev.to | 2024-01-05

    20. Pachyderm | Github | tutorial

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Big Data related posts

Index

What are some of the best open-source Big Data projects? This list will help you:

Project Stars
1 awesome-scalability 53,036
2 Apache Spark 38,320
3 ClickHouse 34,054
4 data-science-ipython-notebooks 26,459
5 Apache Flink 23,128
6 gun 17,784
7 Presto 15,582
8 QuestDB 13,448
9 Cookbook 12,899
10 kafka-manager 11,670
11 NebulaGraph Database 10,114
12 Trino 9,552
13 Cython 8,891
14 kafka-ui 8,458
15 catboost 7,731
16 starrocks 7,726
17 beam 7,508
18 delta 6,874
19 H2O 6,721
20 risingwave 6,283
21 Zeppelin 6,261
22 arkime 6,101
23 pachyderm 6,071

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com