Scala Spark

Open-source Scala projects categorized as Spark

Top 23 Scala Spark Projects

  • Apache Spark

    Apache Spark - A unified analytics engine for large-scale data processing

  • Project mention: How I've implemented the Medallion architecture using Apache Spark and Apache Hdoop | dev.to | 2024-06-17

    In this project, I'm exploring the Medallion Architecture which is a data design pattern that organizes data into different layers based on structure and/or quality. I'm creating a fictional scenario where a large enterprise that has several branches across the country. Each branch receives purchase orders from an app and deliver the goods to their customers. The enterprise wants to identify the branch that receives the most purchase requests and the branch that has the minimum average delivery time. To achieve that, I've used Apache Spark as a distributed compute engine and Apache Hadoop, in particular HDFS, as my data storage layer. Apache Spark ingest, processes, and stores the app's data on HDFS to be served to a custom dashboard app. You can find all about it, in this Github repo

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • delta

    An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)

  • Project mention: Delta Lake vs. Parquet: A Comparison | news.ycombinator.com | 2024-01-19

    Delta is pretty great, let's you do upserts into tables in DataBricks much easier than without it.

    I think the website is here: https://delta.io

  • SynapseML

    Simple and Distributed Machine Learning

  • Project mention: FLaNK Stack Weekly for 12 September 2023 | dev.to | 2023-09-12
  • spark-nlp

    State of the Art Natural Language Processing

  • Project mention: Spark NLP 5.1.0: Introducing state-of-the-art OpenAI Whisper speech-to-text, OpenAI Embeddings and Completion transformers, MPNet text embeddings, ONNX support for E5 text embeddings, new multi-lingual BART Zero-Shot text classification, and much more! | /r/Python | 2023-09-06
  • deequ

    Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

  • Quill

    Compile-time Language Integrated Queries for Scala

  • Project mention: Dear Sir, You Have Built a Compiler (2022) | news.ycombinator.com | 2023-08-17

    https://github.com/zio/zio-quill

    This library does exactly what you prescribe. Pretty sure under the hood it's using macros with string templates

  • kyuubi

    Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • spark-cassandra-connector

    DataStax Connector for Apache Spark to Apache Cassandra (by datastax)

  • Jupyter Scala

    A Scala kernel for Jupyter

  • mleap

    MLeap: Deploy ML Pipelines to Production

  • LearningSparkV2

    This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]

  • adam

    ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

  • H2O

    Sparkling Water provides H2O functionality inside Spark cluster

  • tispark

    TiSpark is built for running Apache Spark on top of TiDB/TiKV

  • frameless

    Expressive types for Spark.

  • incubator-livy

    Apache Livy is an open source REST interface for interacting with Apache Spark from anywhere.

  • spark-rapids

    Spark RAPIDS plugin - accelerate Apache Spark with GPUs

  • spark-daria

    Essential Spark extensions and helper methods ✨😲

  • delta-sharing

    An open protocol for secure data sharing

  • Project mention: Azure data lake - Data Share | /r/dataengineering | 2023-06-29
  • sparkMeasure

    This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of Spark metrics, making it a practical choice for both developers and data engineers.

  • spline

    Data Lineage Tracking And Visualization Solution (by AbsaOSS)

  • metorikku

    A simplified, lightweight ETL Framework based on Apache Spark

  • spark-excel

    A Spark plugin for reading and writing Excel files

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Scala Spark discussion

Log in or Post with

Scala Spark related posts

  • Spark NLP 5.1.0: Introducing state-of-the-art OpenAI Whisper speech-to-text, OpenAI Embeddings and Completion transformers, MPNet text embeddings, ONNX support for E5 text embeddings, new multi-lingual BART Zero-Shot text classification, and much more!

    1 project | /r/Python | 6 Sep 2023
  • Azure data lake - Data Share

    1 project | /r/dataengineering | 29 Jun 2023
  • Pandas was faster and less memory intensive then crealytics pyspark. How is it possible?

    2 projects | /r/dataengineering | 17 Jun 2023
  • The "Big Three's" Data Storage Offerings

    2 projects | /r/dataengineering | 15 Jun 2023
  • Medallion/lakehouse architecture data modelling

    1 project | /r/dataengineering | 3 Jun 2023
  • How to build a data pipeline using Delta Lake

    2 projects | dev.to | 19 May 2023
  • PySpark for NLP Workshop - Materials and Jupyter Notebooks

    2 projects | /r/dataengineering | 14 May 2023
  • A note from our sponsor - SaaSHub
    www.saashub.com | 23 Jun 2024
    SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source Spark projects in Scala? This list will help you:

Project Stars
1 Apache Spark 38,762
2 delta 7,197
3 SynapseML 4,999
4 spark-nlp 3,745
5 deequ 3,158
6 Quill 2,140
7 kyuubi 2,000
8 spark-cassandra-connector 1,933
9 Jupyter Scala 1,573
10 mleap 1,499
11 LearningSparkV2 1,095
12 adam 966
13 H2O 953
14 tispark 880
15 frameless 869
16 incubator-livy 863
17 spark-rapids 754
18 spark-daria 742
19 delta-sharing 711
20 sparkMeasure 668
21 spline 586
22 metorikku 577
23 spark-excel 450

Sponsored
Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com