Scala Spark

Open-source Scala projects categorized as Spark

Top 23 Scala Spark Projects

  • GitHub repo Apache Spark

    Apache Spark - A unified analytics engine for large-scale data processing

    Project mention: 5 Best Big Data Frameworks You Can Learn in 2021 | dev.to | 2021-06-18

    Both Fortune 500 and small companies are looking for competent people who can derive useful insight from their huge pile of data and that's where Big Data Framework like Apache Hadoop, Apache Spark, Flink, Storm, and Hive can help.

  • GitHub repo BigDL

    BigDL: Distributed Deep Learning Framework for Apache Spark

    Project mention: Machine learning on JVM | reddit.com/r/scala | 2021-04-05

    Intel BigDL for Spark which again is for Spark.

  • GitHub repo delta

    An open-source storage layer that brings scalable, ACID transactions to Apache Spark™ and big data workloads. (by delta-io)

    Project mention: How to only read new data in set? | reddit.com/r/apachespark | 2021-05-20

    That's the exact use case for The Linux Foundation's Delta Lake project ( https://delta.io/ ) and Structured Streaming

  • GitHub repo mmlspark

    Microsoft Machine Learning for Apache Spark

    Project mention: Machine learning on JVM | reddit.com/r/scala | 2021-04-05

    Microsoft ML for Spark gets you a range of powerful ML features on Spark.

  • GitHub repo spark-nlp

    State of the Art Natural Language Processing

    Project mention: John Snow Labs Spark-NLP 3.1.0: Over 2600+ new models and pipelines in 200+ languages, new DistilBERT, RoBERTa, and XLM-RoBERTa transformers, support for external Transformers, and lots more! | reddit.com/r/java | 2021-06-08
  • GitHub repo Quill

    Compile-time Language Integrated Queries for Scala (by getquill)

    Project mention: Scala, 2.12/2.13, which driver/library recommend for connecting to Cassandra | reddit.com/r/scala | 2021-06-19

    https://github.com/getquill/quill is my choice. Works like a charm.

  • GitHub repo deequ

    Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

    Project mention: High level overviews of how to properly publish Spark open source libraries (Scala and PySpark) | reddit.com/r/apachespark | 2021-04-15

    I am working with the Deequ maintainers and gave them some detailed suggestions on how to maintain a Scala open source lib. TL;DR:

  • GitHub repo Jupyter Scala

    A Scala kernel for Jupyter

    Project mention: Is there any editor or IDE that supports Ammonite with inline dependencies? | reddit.com/r/scala | 2021-03-10

    I use Almond in JupyterLab, which has pretty solid code completion. In IntelliJ, you can create a scratch sc file and run lines of it in the Scala REPL. That's really convenient for code completion and I normally will use that when I'm testing something from a specific project.

  • GitHub repo H2O

    Sparkling Water provides H2O functionality inside Spark cluster

  • GitHub repo frameless

    Expressive types for Spark.

    Project mention: Guide for Apache Spark Setup, Job Optimisation, AWS EMR Cluster Configuration, S3, YARN and HDFS Optimisation | reddit.com/r/apachespark | 2021-04-10

    For type safety with dataframes, techniques like https://github.com/typelevel/frameless can be used.

  • GitHub repo spark-daria

    Essential Spark extensions and helper methods ✨😲

    Project mention: Ask HN: What are some tools / libraries you built yourself? | news.ycombinator.com | 2021-05-16

    I built daria (https://github.com/MrPowers/spark-daria) to make it easier to write Spark and spark-fast-tests (https://github.com/MrPowers/spark-fast-tests) to provide a good testing workflow.

    quinn (https://github.com/MrPowers/quinn) and chispa (https://github.com/MrPowers/chispa) are the PySpark equivalents.

    Built bebe (https://github.com/MrPowers/bebe) to expose the Spark Catalyst expressions that aren't exposed to the Scala / Python APIs.

    Also build spark-sbt.g8 to create a Spark project with a single command: https://github.com/MrPowers/spark-sbt.g8

  • GitHub repo metorikku

    A simplified, lightweight ETL Framework based on Apache Spark

  • GitHub repo ScalNet

    A Scala wrapper for Deeplearning4j, inspired by Keras. Scala + DL + Spark + GPUs

  • GitHub repo spark-fast-tests

    Apache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)

    Project mention: Ask HN: What are some tools / libraries you built yourself? | news.ycombinator.com | 2021-05-16

    I built daria (https://github.com/MrPowers/spark-daria) to make it easier to write Spark and spark-fast-tests (https://github.com/MrPowers/spark-fast-tests) to provide a good testing workflow.

    quinn (https://github.com/MrPowers/quinn) and chispa (https://github.com/MrPowers/chispa) are the PySpark equivalents.

    Built bebe (https://github.com/MrPowers/bebe) to expose the Spark Catalyst expressions that aren't exposed to the Scala / Python APIs.

    Also build spark-sbt.g8 to create a Spark project with a single command: https://github.com/MrPowers/spark-sbt.g8

  • GitHub repo delight

    A Spark UI and Spark History Server alternative with CPU and Memory metrics! Delight is free, cross-platform, and open-source.

    Project mention: The New & Improved Spark UI & Spark History Server is now Generally Available | dev.to | 2021-05-07

    We encourage you to try it out! Sign up, follow the installation instructions on our github page, and let us know your feedback over email (by replying to the welcome email) or using the live chat window in the product.

  • GitHub repo opaque-sql

    An encrypted data analytics platform

    Project mention: Announcing MC²: Securely perform analytics and machine learning on confidential data | dev.to | 2021-06-17

    The MC2 Compute Services: MC2 offers several compute services: these include Spark SQL, distributed XGBoost, and secure aggregation for federated learning. All are intended to run in a primarily untrusted environment, such as a cluster of machines hosted on a public cloud, that has support for trusted execution environments (hardware enclaves). Data is encrypted in transit using a client key and only ever decrypted inside hardware enclaves, providing the previously mentioned security guarantees for data-in-use. For all compute services, MC2 leverages the Open Enclave SDK, a project intended to provide a consistent API for a variety of different enclave architectures.

  • GitHub repo ZparkIO

    Boiler plate framework to use Spark and ZIO together.

    Project mention: Recommendations for specializing in Spark (Scala) | reddit.com/r/scala | 2020-12-22
  • GitHub repo spark-snowflake

    Snowflake Data Source for Apache Spark.

    Project mention: Why Databricks Is Winning | news.ycombinator.com | 2021-02-14

    Snowflake and Databricks are different, sometimes complementary technologies. You can store data in Snowflake & query it with Databricks for example: https://github.com/snowflakedb/spark-snowflake

    Snowflake predicate pushdown filtering seems quite promising: https://www.snowflake.com/blog/snowflake-spark-part-2-pushin...

    Think both these companies can win.

  • GitHub repo Clustering4Ever

    C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.

  • GitHub repo Schemer

    Schema registry for CSV, TSV, JSON, AVRO and Parquet schema. Supports schema inference and GraphQL API.

  • GitHub repo ammonite-spark

    Run spark calculations from Ammonite

    Project mention: Learning Spark Scala: I'm a medium Python Data Engineer with some experience in Java. I have to learn "enough" Scala to be at ease with Spark's Scala API. I have three weeks. Where should I start ? | reddit.com/r/scala | 2021-02-03

    https://github.com/alexarchambault/ammonite-spark made the experience more pleasant for me.

  • GitHub repo cobrix

    A COBOL parser and Mainframe/EBCDIC data source for Apache Spark

  • GitHub repo osm4scala

    Scala and Spark library focused on reading OpenStreetMap Pbf files.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2021-06-19.

Index

What are some of the best open-source Spark projects in Scala? This list will help you:

Project Stars
1 Apache Spark 30,129
2 BigDL 3,737
3 delta 3,406
4 mmlspark 2,350
5 spark-nlp 2,197
6 Quill 1,877
7 deequ 1,721
8 Jupyter Scala 1,375
9 H2O 896
10 frameless 734
11 spark-daria 589
12 metorikku 384
13 ScalNet 344
14 spark-fast-tests 282
15 delight 142
16 opaque-sql 138
17 ZparkIO 134
18 spark-snowflake 116
19 Clustering4Ever 115
20 Schemer 97
21 ammonite-spark 95
22 cobrix 86
23 osm4scala 48