Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR. Learn more →
Top 23 Scala Spark Projects
-
Project mention: Unveiling the Apache License 2.0: A Deep Dive into Open Source Freedom | dev.to | 2025-03-11
One of the key attributes of Apache License 2.0 is its flexible nature. Permitting use in both proprietary and open source environments, it has become the go-to choice for innovative projects ranging from the Apache HTTP Server to large-scale initiatives like Apache Spark and Hadoop. This flexibility is not solely legal; it is also philosophical. The license is designed to encourage transparency and maintain a healthy balance between freedom and accountability, ultimately making it easier for developers to adapt and contribute without restrictive legal barriers. Another modern twist discussed in the article is the concept of dual licensing. Dual licensing can offer an attractive method for additional commercial exploitation while still upholding open source principles. However, as the article cautions, dual licensing involves legal intricacy and demands rigor in managing Contributor License Agreements (CLAs), a challenge that the open source community navigates with ongoing debates. For developers looking to understand similar innovative approaches to licensing, further information can be explored at License Token.
-
CodeRabbit
CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
-
delta
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)
Project mention: Stream Processing Systems in 2025: RisingWave, Flink, Spark Streaming, and What's Ahead | dev.to | 2025-01-27When it comes to stream processing systems, Iceberg support varies across vendors. Databricks, which oversees Spark Streaming, focuses on Delta Lake. Apache Flink, heavily influenced by Alibaba’s contributions, promotes Paimon, an alternative to Iceberg. RisingWave, on the other hand, fully embraces Iceberg. Rather than focusing solely on one table format, RisingWave aims to support various catalog services, including AWS Glue Catalog, Polaris, and Unity Catalog.
-
-
-
deequ
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Deequ GitHub Repository
-
kyuubi
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
-
If you're looking for more modern notebooks supporting Scala (and Spark):
- https://almond.sh
- https://polynote.org
Toree is mostly dead but might also get a Scala 2.13 release now that Spark 4.0 is approaching.
-
-
LearningSparkV2
This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
-
adam
ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.
-
-
incubator-livy
Apache Livy is an open source REST interface for interacting with Apache Spark from anywhere.
-
-
-
-
Project mention: Ask AN: Anyone using Delta Sharing in production? | news.ycombinator.com | 2024-07-01
-
-
sparkMeasure
This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of Spark metrics, making it a practical choice for both developers and data engineers.
-
-
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Scala Spark discussion
Scala Spark related posts
-
Deequ: Your Data's BFF
-
Snowflake removes Spark Pushdown support in favour of Snowpark
-
Make Rust Object Oriented with the dual-trait pattern
-
Ask AN: Anyone using Delta Sharing in production?
-
Spark NLP 5.1.0: Introducing state-of-the-art OpenAI Whisper speech-to-text, OpenAI Embeddings and Completion transformers, MPNet text embeddings, ONNX support for E5 text embeddings, new multi-lingual BART Zero-Shot text classification, and much more!
-
Azure data lake - Data Share
-
Pandas was faster and less memory intensive then crealytics pyspark. How is it possible?
-
A note from our sponsor - CodeRabbit
coderabbit.ai | 26 Mar 2025
Index
What are some of the best open-source Spark projects in Scala? This list will help you:
# | Project | Stars |
---|---|---|
1 | Apache Spark | 40,785 |
2 | delta | 7,892 |
3 | SynapseML | 5,106 |
4 | spark-nlp | 3,942 |
5 | deequ | 3,385 |
6 | kyuubi | 2,162 |
7 | Quill | 2,152 |
8 | spark-cassandra-connector | 1,944 |
9 | Jupyter Scala | 1,610 |
10 | mleap | 1,515 |
11 | LearningSparkV2 | 1,261 |
12 | adam | 1,016 |
13 | H2O | 966 |
14 | incubator-livy | 905 |
15 | tispark | 886 |
16 | frameless | 884 |
17 | spark-rapids | 877 |
18 | delta-sharing | 817 |
19 | spark-daria | 759 |
20 | sparkMeasure | 736 |
21 | spline | 614 |
22 | metorikku | 585 |
23 | spark-excel | 483 |