Scala Spark

Open-source Scala projects categorized as Spark | Edit details

Top 23 Scala Spark Projects

  • GitHub repo Apache Spark

    Apache Spark - A unified analytics engine for large-scale data processing

    Project mention: What is B2D Sector? | | 2021-10-17

    Example tools:\ Tensorflow, Tableau, Apache Spark, Matlab, Jupyter

  • GitHub repo delta

    An open-source storage layer that brings scalable, ACID transactions to Apache Spark™ and big data workloads. (by delta-io)

    Project mention: SCD type 2 in spark | | 2021-10-15

    Use Hudi Or Delta Lake

  • Nanos

    Run Linux Software Faster and Safer than Linux with Unikernels.

  • GitHub repo SynapseML

    Microsoft Machine Learning for Apache Spark

    Project mention: Machine learning on JVM | | 2021-04-05

    Microsoft ML for Spark gets you a range of powerful ML features on Spark.

  • GitHub repo spark-nlp

    State of the Art Natural Language Processing

    Project mention: November 2021 workshops -- please comment about your preferences | | 2021-10-15
  • GitHub repo Quill

    Compile-time Language Integrated Queries for Scala (by getquill)

    Project mention: Scala, 2.12/2.13, which driver/library recommend for connecting to Cassandra | | 2021-06-19 is my choice. Works like a charm.

  • GitHub repo deequ

    Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

    Project mention: PySpark - How to get Corrupted Records after Casting | | 2021-09-28

    Deequ (this is the Scala version but they have PyDeequ also)

  • GitHub repo Jupyter Scala

    A Scala kernel for Jupyter

    Project mention: EDA libraries for Scala and Spark? | | 2021-06-23

    What about and

  • Scout APM

    Scout APM: A developer's best friend. Try free for 14-days. Scout APM uses tracing logic that ties bottlenecks to source code so you know the exact line of code causing performance issues and can get back to building a great product faster.

  • GitHub repo H2O

    Sparkling Water provides H2O functionality inside Spark cluster

  • GitHub repo incubator-kyuubi

    Apache Kyuubi is a distributed multi-tenant JDBC server for large-scale data processing and analytics, built on top of Apache Spark

    Project mention: Release Kyuubi-v1.1.0 | | 2021-03-12
  • GitHub repo frameless

    Expressive types for Spark.

    Project mention: Guide for Apache Spark Setup, Job Optimisation, AWS EMR Cluster Configuration, S3, YARN and HDFS Optimisation | | 2021-04-10

    For type safety with dataframes, techniques like can be used.

  • GitHub repo spark-daria

    Essential Spark extensions and helper methods ✨😲

    Project mention: Is Spark - The Defenitive Guide outdated? | | 2021-07-01

    They spent a lot of effort improving the catalyst engine under the hood too and making it easier to extend and improve it in the future. Making it easy to add your own native code to Spark itself. Shameless plug of a blog post I wrote on this subject which basically reiterates what Matthew Powers, author of Spark Daria and quinn, wrote here.

  • GitHub repo metorikku

    A simplified, lightweight ETL Framework based on Apache Spark

  • GitHub repo sparkMeasure

    This is the development repository of SparkMeasure, a tool for performance troubleshooting of Apache Spark workloads. It simplifies the collection and analysis of Spark task and stage metrics data.

    Project mention: Spark Write Metrics | | 2021-07-01

    As an alternative to other proposed solutions, you could try and leverage the Spark metrics system to extract this information from accumulators. Metrics include total records and bytes written at each stage, among others. Take a look at SparkMeasure as well as an implementation example if you need to roll your own.

  • GitHub repo ScalNet

    A Scala wrapper for Deeplearning4j, inspired by Keras. Scala + DL + Spark + GPUs

  • GitHub repo spark-fast-tests

    Apache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)

    Project mention: Show dataengineering: beavis, a library for unit testing Pandas/Dask code | | 2021-08-09

    I am the author of spark-fast-tests and chispa, libraries for unit testing Scala Spark / PySpark code.

  • GitHub repo spark-excel

    A Spark plugin for reading Excel files via Apache POI

    Project mention: How do I learn to read a plug-in? | | 2021-08-27

    Plug-in in question is GitHub - crealytics/spark-excel: A Spark plugin for reading Excel files via Apache POI , but I guess it could be any. Assuming that I can read the plain code in an individual .scala file how do I learn to understand how it all pieces together and what the underlying code being run is?

  • GitHub repo delight

    A Spark UI and Spark History Server alternative with CPU and Memory metrics! Delight is free, cross-platform, and open-source.

    Project mention: The New & Improved Spark UI & Spark History Server is now Generally Available | | 2021-05-07

    We encourage you to try it out! Sign up, follow the installation instructions on our github page, and let us know your feedback over email (by replying to the welcome email) or using the live chat window in the product.

  • GitHub repo isolation-forest

    A Spark/Scala implementation of the isolation forest unsupervised outlier detection algorithm.

    Project mention: A Spark/Scala implementation of the isolation forest unsupervised outlier detection algorithm. | | 2021-10-26
  • GitHub repo opaque-sql

    An encrypted data analytics platform

    Project mention: How to Run Spark SQL on Encrypted Data | | 2021-08-10

    Introducing Opaque SQL, an open-source platform for securely running Spark SQL queries on encrypted data. Built by top systems and security researchers at UC Berkeley, the platform uses hardware enclaves to securely execute queries on private data in an untrusted environment.

  • GitHub repo ZparkIO

    Boiler plate framework to use Spark and ZIO together.

    Project mention: Recommendations for specializing in Spark (Scala) | | 2020-12-22
  • GitHub repo spark-snowflake

    Snowflake Data Source for Apache Spark.

    Project mention: Why Databricks Is Winning | | 2021-02-14

    Snowflake and Databricks are different, sometimes complementary technologies. You can store data in Snowflake & query it with Databricks for example:

    Snowflake predicate pushdown filtering seems quite promising:

    Think both these companies can win.

  • GitHub repo Clustering4Ever

    C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.

  • GitHub repo Schemer

    Schema registry for CSV, TSV, JSON, AVRO and Parquet schema. Supports schema inference and GraphQL API.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2021-10-26.


What are some of the best open-source Spark projects in Scala? This list will help you:

Project Stars
1 Apache Spark 31,120
2 delta 3,718
3 SynapseML 2,457
4 spark-nlp 2,438
5 Quill 1,947
6 deequ 1,936
7 Jupyter Scala 1,401
8 H2O 909
9 incubator-kyuubi 781
10 frameless 763
11 spark-daria 618
12 metorikku 424
13 sparkMeasure 404
14 ScalNet 343
15 spark-fast-tests 297
16 spark-excel 253
17 delight 185
18 isolation-forest 158
19 opaque-sql 151
20 ZparkIO 142
21 spark-snowflake 123
22 Clustering4Ever 117
23 Schemer 106
Find remote jobs at our new job board There are 38 new remote jobs listed recently.
Are you hiring? Post a new remote job listing for free.
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives