Top 23 Spark Open-Source Projects

Apache Spark

101 38,249 10.0 Scala

Apache Spark - A unified analytics engine for large-scale data processing

Project mention: "xAI will open source Grok" | news.ycombinator.com | 2024-03-11

data-science-ipython-notebooks

1 26,438 0.0 Python

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
Redash

38 24,917 9.5 Python

Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.

Project mention: Redash: Connect to data source, easily visualize, dashboard and share your data | news.ycombinator.com | 2024-03-20

data-engineering-zoomcamp

119 22,343 9.4 Jupyter Notebook

Free Data Engineering course!

Project mention: Data Engineering Zoomcamp Week 6 - using redpanda 1 | dev.to | 2024-04-09

References: Data engineering zoomcamp week 6 course and homework notes: https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/cohorts/2024/06-streaming

horovod

8 13,942 5.8 Python

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
Deeplearning4j

13 13,408 6.5 Java

Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learning using automatic differentiation.

Project mention: Deeplearning4j Suite Overview | news.ycombinator.com | 2024-03-29

ds-cheatsheets

2 12,570 0.0

List of Data Science Cheatsheets to rule the world
WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
doris

42 11,272 10.0 Java

Apache Doris is an easy-to-use, high performance and unified analytics database.

Project mention: Variant in Apache Doris 2.1.0: a new data type 8 times faster than JSON for semi-structured data analysis | dev.to | 2024-03-27

As an open-source real-time data warehouse, Apache Doris provides semi-structured data processing capabilities, and the newly-released version 2.1.0 makes a stride in this direction. Before V2.1, Apache Doris stores semi-structured data as JSON files. However, during query execution, the real-time parsing of JSON data leads to high CPU and I/O consumption in addition to high query latency, especially when the dataset is huge and complicated. Moreover, the lack of a pre-defined schema means there is no handle for query optimization.

Mage

76 6,953 9.9 Python

🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai

Project mention: A mage on the Hero’s Journey: a fantasy epic on how a startup rose from the ashes | dev.to | 2023-06-12

In the coming years, Mage will create a cooperative experience so that developers can build data pipelines with their team and level up together. After that journey, Mage will go on an epic quest to create the 1st open world community experience in the data universe.

delta

69 6,847 9.8 Scala

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)

Project mention: Delta Lake vs. Parquet: A Comparison | news.ycombinator.com | 2024-01-19

Delta is pretty great, let's you do upserts into tables in DataBricks much easier than without it.
I think the website is here: https://delta.io

H2O

10 6,721 9.7 Jupyter Notebook

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

Project mention: Really struggling with open source models | /r/LocalLLaMA | 2023-07-12

I would use H20 if I were you. You can try out LLMs with a nice GUI. Unless you have some familiarity with the tools needed to run these projects, it can be frustrating. https://h2o.ai/

Alluxio (formerly Tachyon)

0 6,624 9.8 Java

Alluxio, data orchestration for analytics and machine learning in the cloud
Zeppelin

8 6,261 8.7 Java

Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

Project mention: Serverless Apache Zeppelin on AWS | dev.to | 2024-02-04

Now we can proceed with the definition of Apache Zeppelin. It is a web-based notebook that enables data-driven, interactive data analytics and collaborative documents with Python, Scala, SQL, Spark, and more. You can execute code and even schedule a job (via cron) to run at regular intervals.

dev-setup

3 6,032 0.0 Python

macOS development environment setup: Easy-to-understand instructions with automated setup scripts for developer tools like Vim, Sublime Text, Bash, iTerm, Python data analysis, Spark, Hadoop MapReduce, AWS, Heroku, JavaScript web development, Android development, common data stores, and dev-based OS X defaults.
BigDL

5 5,857 9.9 Python

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max). A PyTorch LLM library that seamlessly integrates with llama.cpp, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, ModelScope, etc.

Project mention: LLaMA Now Goes Faster on CPUs | news.ycombinator.com | 2024-03-31

Any performance benchmark against intel's 'IPEX-LLM'[0] or others?
[0] - https://github.com/intel-analytics/ipex-llm

sqlglot

55 5,389 9.9 Python

Python SQL Parser and Transpiler

Project mention: Transpile Any SQL to PostgreSQL Dialect | news.ycombinator.com | 2024-03-18

Recommend checking out https://github.com/tobymao/sqlglot if you are interested in this capability for other SQL dialects
Tools like this are helpful for:
- Rendering SQL in a consistent way, eg for snapshot testing

SynapseML

18 4,964 8.9 Scala

Simple and Distributed Machine Learning

Project mention: FLaNK Stack Weekly for 12 September 2023 | dev.to | 2023-09-12

TensorFlowOnSpark

2 3,864 1.4 Python

TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.
spark-nlp

87 3,667 9.4 Scala

State of the Art Natural Language Processing

Project mention: Spark NLP 5.1.0: Introducing state-of-the-art OpenAI Whisper speech-to-text, OpenAI Embeddings and Completion transformers, MPNet text embeddings, ONNX support for E5 text embeddings, new multi-lingual BART Zero-Shot text classification, and much more! | /r/Python | 2023-09-06

HELK

10 3,659 0.0 Jupyter Notebook

The Hunting ELK
RoaringBitmap

24 3,372 8.5 Java

A better compressed bitset in Java: used by Apache Spark, Netflix Atlas, Apache Pinot, Tablesaw, and many others

Project mention: Iterating over Bit Sets Quickly | news.ycombinator.com | 2024-02-24

I was recently reading about Roaring https://roaringbitmap.org/ which is a highly optimized compressed bitset implementation. I reccomend reading about it if you are interested in this sort of thing. The talk at https://roaringbitmap.org/talks/ is especially good.

koalas

2 3,317 4.6 Python

Koalas: pandas API on Apache Spark
linkis

2 3,225 9.5 Java

Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.
SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-04-09.

Spark related posts

Data Engineering Zoomcamp Week 6 - using redpanda 1
1 project | dev.to | 9 Apr 2024
Final project part 5
1 project | dev.to | 3 Apr 2024
PyTorch Library for Running LLM on Intel CPU and GPU
1 project | news.ycombinator.com | 3 Apr 2024
Apache Uniffle: high performance, general purpose remote shuffle service
1 project | news.ycombinator.com | 19 Mar 2024
Splink: Fast, accurate, scalable probabilistic data linkage
1 project | news.ycombinator.com | 13 Mar 2024
Apache Arrow DataFusion Comet Spark Accelerator
1 project | news.ycombinator.com | 7 Mar 2024
A deep dive into the concept and world of Apache Iceberg Catalogs
1 project | dev.to | 1 Mar 2024
A note from our sponsor - InfluxDB
www.influxdata.com | 19 Apr 2024

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source Spark projects? This list will help you:

	Project	Stars
1	Apache Spark	38,249
2	data-science-ipython-notebooks	26,438
3	Redash	24,917
4	data-engineering-zoomcamp	22,343
5	horovod	13,942
6	Deeplearning4j	13,408
7	ds-cheatsheets	12,570
8	doris	11,272
9	Mage	6,953
10	delta	6,847
11	H2O	6,721
12	Alluxio (formerly Tachyon)	6,624
13	Zeppelin	6,261
14	dev-setup	6,032
15	BigDL	5,857
16	sqlglot	5,389
17	SynapseML	4,964
18	TensorFlowOnSpark	3,864
19	spark-nlp	3,667
20	HELK	3,659
21	RoaringBitmap	3,372
22	koalas	3,317
23	linkis	3,225