Scala Spark

Open-source Scala projects categorized as Spark

Top 23 Scala Spark Projects

  1. Apache Spark

    Apache Spark - A unified analytics engine for large-scale data processing

    Project mention: Unveiling the Apache License 2.0: A Deep Dive into Open Source Freedom | dev.to | 2025-03-11

    One of the key attributes of Apache License 2.0 is its flexible nature. Permitting use in both proprietary and open source environments, it has become the go-to choice for innovative projects ranging from the Apache HTTP Server to large-scale initiatives like Apache Spark and Hadoop. This flexibility is not solely legal; it is also philosophical. The license is designed to encourage transparency and maintain a healthy balance between freedom and accountability, ultimately making it easier for developers to adapt and contribute without restrictive legal barriers. Another modern twist discussed in the article is the concept of dual licensing. Dual licensing can offer an attractive method for additional commercial exploitation while still upholding open source principles. However, as the article cautions, dual licensing involves legal intricacy and demands rigor in managing Contributor License Agreements (CLAs), a challenge that the open source community navigates with ongoing debates. For developers looking to understand similar innovative approaches to licensing, further information can be explored at License Token.

  2. CodeRabbit

    CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.

    CodeRabbit logo
  3. delta

    An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)

    Project mention: Stream Processing Systems in 2025: RisingWave, Flink, Spark Streaming, and What's Ahead | dev.to | 2025-01-27

    When it comes to stream processing systems, Iceberg support varies across vendors. Databricks, which oversees Spark Streaming, focuses on Delta Lake. Apache Flink, heavily influenced by Alibaba’s contributions, promotes Paimon, an alternative to Iceberg. RisingWave, on the other hand, fully embraces Iceberg. Rather than focusing solely on one table format, RisingWave aims to support various catalog services, including AWS Glue Catalog, Polaris, and Unity Catalog.

  4. SynapseML

    Simple and Distributed Machine Learning

  5. spark-nlp

    State of the Art Natural Language Processing

  6. deequ

    Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

    Project mention: Deequ: Your Data's BFF | dev.to | 2024-08-23

    Deequ GitHub Repository

  7. kyuubi

    Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.

  8. Quill

    Compile-time Language Integrated Queries for Scala

  9. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  10. spark-cassandra-connector

    DataStax Connector for Apache Spark to Apache Cassandra (by datastax)

  11. Jupyter Scala

    A Scala kernel for Jupyter

    Project mention: Apache Zeppelin | news.ycombinator.com | 2024-09-02

    If you're looking for more modern notebooks supporting Scala (and Spark):

    - https://almond.sh

    - https://polynote.org

    Toree is mostly dead but might also get a Scala 2.13 release now that Spark 4.0 is approaching.

  12. mleap

    MLeap: Deploy ML Pipelines to Production

  13. LearningSparkV2

    This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]

  14. adam

    ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

  15. H2O

    Sparkling Water provides H2O functionality inside Spark cluster

  16. incubator-livy

    Apache Livy is an open source REST interface for interacting with Apache Spark from anywhere.

  17. tispark

    TiSpark is built for running Apache Spark on top of TiDB/TiKV

  18. frameless

    Expressive types for Spark.

  19. spark-rapids

    Spark RAPIDS plugin - accelerate Apache Spark with GPUs

  20. delta-sharing

    An open protocol for secure data sharing

    Project mention: Ask AN: Anyone using Delta Sharing in production? | news.ycombinator.com | 2024-07-01
  21. spark-daria

    Essential Spark extensions and helper methods ✨😲

  22. sparkMeasure

    This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of Spark metrics, making it a practical choice for both developers and data engineers.

  23. spline

    Data Lineage Tracking And Visualization Solution (by AbsaOSS)

  24. metorikku

    A simplified, lightweight ETL Framework based on Apache Spark

  25. spark-excel

    A Spark plugin for reading and writing Excel files

  26. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Scala Spark discussion

Log in or Post with

Scala Spark related posts

  • Deequ: Your Data's BFF

    3 projects | dev.to | 23 Aug 2024
  • Snowflake removes Spark Pushdown support in favour of Snowpark

    1 project | news.ycombinator.com | 5 Aug 2024
  • Make Rust Object Oriented with the dual-trait pattern

    2 projects | dev.to | 8 Jul 2024
  • Ask AN: Anyone using Delta Sharing in production?

    1 project | news.ycombinator.com | 1 Jul 2024
  • Spark NLP 5.1.0: Introducing state-of-the-art OpenAI Whisper speech-to-text, OpenAI Embeddings and Completion transformers, MPNet text embeddings, ONNX support for E5 text embeddings, new multi-lingual BART Zero-Shot text classification, and much more!

    1 project | /r/Python | 6 Sep 2023
  • Azure data lake - Data Share

    1 project | /r/dataengineering | 29 Jun 2023
  • Pandas was faster and less memory intensive then crealytics pyspark. How is it possible?

    2 projects | /r/dataengineering | 17 Jun 2023
  • A note from our sponsor - CodeRabbit
    coderabbit.ai | 26 Mar 2025
    Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR. Learn more →

Index

What are some of the best open-source Spark projects in Scala? This list will help you:

# Project Stars
1 Apache Spark 40,785
2 delta 7,892
3 SynapseML 5,106
4 spark-nlp 3,942
5 deequ 3,385
6 kyuubi 2,162
7 Quill 2,152
8 spark-cassandra-connector 1,944
9 Jupyter Scala 1,610
10 mleap 1,515
11 LearningSparkV2 1,261
12 adam 1,016
13 H2O 966
14 incubator-livy 905
15 tispark 886
16 frameless 884
17 spark-rapids 877
18 delta-sharing 817
19 spark-daria 759
20 sparkMeasure 736
21 spline 614
22 metorikku 585
23 spark-excel 483

Sponsored
CodeRabbit: AI Code Reviews for Developers
Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
coderabbit.ai

Did you know that Scala is
the 38th most popular programming language
based on number of references?