SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 Big Data Open-Source Projects
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
data-science-ipython-notebooks
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
NebulaGraph Database
A distributed, fast open-source graph database featuring horizontal scalability and high availability (by vesoft-inc)
-
Trino
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
-
catboost
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
-
starrocks
StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries. InfoWorld’s 2023 BOSSIE Award for best open source software.
-
delta
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)
-
H2O
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
-
risingwave
Cloud-native SQL stream processing, analytics, and management. KsqlDB and Apache Flink alternative. 🚀 10x more productive. 🚀 10x more cost-efficient.
-
Zeppelin
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Project mention: We Built a 19 PiB Logging Platform with ClickHouse and Saved Millions | news.ycombinator.com | 2024-04-02Yes, we are working on it! :) Taking some of the learnings from current experimental JSON Object datatype, we are now working on what will become the production-ready implementation. Details here: https://github.com/ClickHouse/ClickHouse/issues/54864
Variant datatype is already available as experimental in 24.1, Dynamic datatype is WIP (PR almost ready), and JSON datatype is next up. Check out the latest comment on that issue with how the Dynamic datatype will work: https://github.com/ClickHouse/ClickHouse/issues/54864#issuec...
7. Apache Flink | Github | tutorial
We have some of this functionality in Presto (https://github.com/prestodb/presto), but it takes fair bit of work to implement it for all the different backends.
If your data lacks uniform time intervals between consecutive entries, QuestDB offers a solution by allowing you to sample your data. After that, MindsDB facilitates creating, training, and deploying your time-series models.
https://github.com/andkret/Cookbook arunca un ochi aici. Omul are si youtube channel https://www.youtube.com/@andreaskayy
Project mention: Trino: Fast distributed SQL query engine for big data analytics | news.ycombinator.com | 2024-03-19
Project mention: Ask HN: C/C++ developer wanting to learn efficient Python | news.ycombinator.com | 2024-04-10
Project mention: CatBoost: Open-source gradient boosting library | news.ycombinator.com | 2024-03-05
Project mention: A MySQL compatible database engine written in pure Go | news.ycombinator.com | 2024-04-09tidb has been around for a while, it is distributed, written in Go and Rust, and MySQL compatible. https://github.com/pingcap/tidb
Somewhat relatedly, StarRocks is also MySQL compatible, written in Java and C++, but it's tackling OLAP use-cases. https://github.com/StarRocks/starrocks
Project mention: Ask HN: Does (or why does) anyone use MapReduce anymore? | news.ycombinator.com | 2024-01-24The "streaming systems" book answers your question and more: https://www.oreilly.com/library/view/streaming-systems/97814.... It gives you a history of how batch processing started with MapReduce, and how attempts at scaling by moving towards streaming systems gave us all the subsequent frameworks (Spark, Beam, etc.).
As for the framework called MapReduce, it isn't used much, but its descendant https://beam.apache.org very much is. Nowadays people often use "map reduce" as a shorthand for whatever batch processing system they're building on top of.
Delta is pretty great, let's you do upserts into tables in DataBricks much easier than without it.
I think the website is here: https://delta.io
I would use H20 if I were you. You can try out LLMs with a nice GUI. Unless you have some familiarity with the tools needed to run these projects, it can be frustrating. https://h2o.ai/
Project mention: Proton, a fast and lightweight alternative to Apache Flink | news.ycombinator.com | 2024-01-30How does this compare to RisingWave and Materialize?
https://github.com/risingwavelabs/risingwave
Now we can proceed with the definition of Apache Zeppelin. It is a web-based notebook that enables data-driven, interactive data analytics and collaborative documents with Python, Scala, SQL, Spark, and more. You can execute code and even schedule a job (via cron) to run at regular intervals.
20. Pachyderm | Github | tutorial
Big Data related posts
- Top 10 Common Data Engineers and Scientists Pain Points in 2024
- Velox: Meta's Unified Execution Engine [pdf]
- Linkedin OpenHouse: Control Plane for Tables in Data Lakehouses
- Fair Benchmarking Considered Difficult (2018) [pdf]
- Choosing Between a Streaming Database and a Stream Processing Framework in Python
- ClickBench – A Benchmark for Analytical DBMS
- Why Postgres RDS didn't work for us
-
A note from our sponsor - SaaSHub
www.saashub.com | 24 Apr 2024
Index
What are some of the best open-source Big Data projects? This list will help you:
Project | Stars | |
---|---|---|
1 | awesome-scalability | 53,036 |
2 | Apache Spark | 38,320 |
3 | ClickHouse | 34,054 |
4 | data-science-ipython-notebooks | 26,459 |
5 | Apache Flink | 23,128 |
6 | gun | 17,784 |
7 | Presto | 15,582 |
8 | QuestDB | 13,448 |
9 | Cookbook | 12,899 |
10 | kafka-manager | 11,670 |
11 | NebulaGraph Database | 10,114 |
12 | Trino | 9,552 |
13 | Cython | 8,891 |
14 | kafka-ui | 8,458 |
15 | catboost | 7,731 |
16 | starrocks | 7,726 |
17 | beam | 7,508 |
18 | delta | 6,874 |
19 | H2O | 6,721 |
20 | risingwave | 6,283 |
21 | Zeppelin | 6,261 |
22 | arkime | 6,101 |
23 | pachyderm | 6,071 |
Sponsored