Judoscale integrates with Django, FastAPI, Celery, and RQ to make autoscaling easy and reliable. Save big, and say goodbye to request timeouts and backed-up task queues. Learn more →
Top 23 Spark Open-Source Projects
-
Project mention: Apache Spark VS cocoindex - a user suggested alternative | libhunt.com/r/spark | 2025-04-01
-
Judoscale
Save 47% on cloud hosting with autoscaling that just works. Judoscale integrates with Django, FastAPI, Celery, and RQ to make autoscaling easy and reliable. Save big, and say goodbye to request timeouts and backed-up task queues.
-
data-engineering-zoomcamp
Data Engineering Zoomcamp is a free nine-week course that covers the fundamentals of data engineering.
Project mention: Study Notes 2.2.7: Managing Schedules and Backfills with BigQuery in Kestra | dev.to | 2025-02-04DE Zoomcamp Resources: Data Engineering Zoomcamp
-
data-science-ipython-notebooks
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
-
Redash
Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.
Project mention: The 50 best open-source alternatives to popular SaaS software | dev.to | 2024-07-10GitHub: Redash GitHub Repository
-
ChuanhuChatGPT
GUI for ChatGPT API and many LLMs. Supports agents, file-based QA, GPT finetuning and query with web search. All with a neat UI.
-
-
-
InfluxDB
InfluxDB high-performance time series database. Collect, organize, and act on massive volumes of high-resolution data to power real-time intelligent systems.
-
Deeplearning4j
Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learn...
-
Project mention: Apache Doris: open-source data warehouse for real time data analytics | news.ycombinator.com | 2024-10-26
-
Mage
🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai
Here, we use the free Mage Ai orchestration tool.
-
delta
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)
Project mention: Twitter's 600-Tweet Daily Limit Crisis: Soaring GCP Costs and the Open Source Fix Elon Musk Ignored | dev.to | 2025-04-10Delta Lake: Delta Lake is an open-source storage layer that provides ACID transactions, scalable metadata management, and data versioning on top of existing data lakes. It aims to bring reliability and performance optimizations to big data workloads while ensuring data integrity and consistency.
-
Project mention: Simplifying SQL function implementation with Rust procedural macro | dev.to | 2025-03-13
Then, utilize declarative macros to generate various types of kernel functions, including functions with 1, 2, and 3 parameters, as well as the input/output combinations of T and Option. Common kernels like unary, binary, ternary, unary_nullable and unary_bytes are generated, partially addressing the last two issues. (For the implementation details, see RisingWave's earlier code.) Theoretically, type exercise could also be used here. For example, introducing a trait to unify (A,), (A, B) and (A, B, C), or utilizing traits of Into and AsRef to unify T, Option, and Result, etc. However, you will probably encounter some type challenges posed by rustc.
-
This could be a nice option to add sqlglot here. As an advanced sql parsing library.
-
H2O
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
-
Alluxio (formerly Tachyon)
Alluxio, data orchestration for analytics and machine learning in the cloud
-
Zeppelin
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
To do so, we will use Kinesis Data Analytics to run an Apache Flink application. To enhance our development experience, we will use Studio notebooks for Kinesis Data Analytics that are powered by Apache Zeppelin.
-
dev-setup
macOS development environment setup: Easy-to-understand instructions with automated setup scripts for developer tools like Vim, Sublime Text, Bash, iTerm, Python data analysis, Spark, Hadoop MapReduce, AWS, Heroku, JavaScript web development, Android development, common data stores, and dev-based OS X defaults.
-
-
-
-
-
RoaringBitmap
A better compressed bitset in Java: used by Apache Spark, Netflix Atlas, Apache Pinot, Tablesaw, and many others
Theres actually a whole website about it! I found it useful when I was doing deeper research into ElasticSearch: https://roaringbitmap.org
-
deequ
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Deequ GitHub Repository
-
CodeRabbit
CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
Spark discussion
Spark related posts
-
Hybrid in-memory and disk cache in Rust
-
Study Note DE Zoomcamp 1.2.4 - Dockerizing the Ingestion Script
-
Stream Processing Systems in 2025: RisingWave, Flink, Spark Streaming, and What's Ahead
-
Data Engineering Zoomcamp 2025 Cohort: Introduction - Self-Study Notes
-
Apache Zeppelin
-
Migrating C# to Python with Claude 3.5 Sonnet.
-
Deequ: Your Data's BFF
-
A note from our sponsor - Judoscale
judoscale.com | 21 Apr 2025
Index
What are some of the best open-source Spark projects? This list will help you:
# | Project | Stars |
---|---|---|
1 | Apache Spark | 40,958 |
2 | data-engineering-zoomcamp | 30,063 |
3 | data-science-ipython-notebooks | 27,993 |
4 | Redash | 27,219 |
5 | ChuanhuChatGPT | 15,421 |
6 | ds-cheatsheets | 15,110 |
7 | horovod | 14,451 |
8 | Deeplearning4j | 13,926 |
9 | doris | 13,529 |
10 | Mage | 8,245 |
11 | delta | 7,957 |
12 | risingwave | 7,657 |
13 | sqlglot | 7,531 |
14 | H2O | 7,117 |
15 | Alluxio (formerly Tachyon) | 6,977 |
16 | Zeppelin | 6,476 |
17 | dev-setup | 6,167 |
18 | SynapseML | 5,119 |
19 | spark-nlp | 3,951 |
20 | TensorFlowOnSpark | 3,874 |
21 | HELK | 3,822 |
22 | RoaringBitmap | 3,647 |
23 | deequ | 3,405 |