Scala Spark

Open-source Scala projects categorized as Spark

Top 23 Scala Spark Projects

  • Apache Spark

    Apache Spark - A unified analytics engine for large-scale data processing

    Project mention: Integrate Pyspark Structured Streaming with confluent-kafka | dev.to | 2023-08-12

    Apache Spark - https://spark.apache.org/

  • SynapseML

    Simple and Distributed Machine Learning

    Project mention: FLaNK Stack Weekly for 12 September 2023 | dev.to | 2023-09-12
  • Mergify

    Updating dependencies is time-consuming.. Solutions like Dependabot or Renovate update but don't merge dependencies. You need to do it manually while it could be fully automated! Add a Merge Queue to your workflow and stop caring about PR management & merging. Try Mergify for free.

  • spark-nlp

    State of the Art Natural Language Processing

    Project mention: Spark NLP 5.1.0: Introducing state-of-the-art OpenAI Whisper speech-to-text, OpenAI Embeddings and Completion transformers, MPNet text embeddings, ONNX support for E5 text embeddings, new multi-lingual BART Zero-Shot text classification, and much more! | /r/Python | 2023-09-06
  • deequ

    Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

    Project mention: [Data Quality] Deequ Feedback request | /r/dataengineering | 2023-03-01

    There's no straightforward way to drop and rerun a metric collection. For example, say you detect a problem in your data. You fix it, rerun the pipeline, and replace the bad data with the good. You'd want your metrics history to reflect the true state of your data. But the "bad run" cannot be dropped. Issue

  • Quill

    Compile-time Language Integrated Queries for Scala

    Project mention: Dear Sir, You Have Built a Compiler (2022) | news.ycombinator.com | 2023-08-17

    https://github.com/zio/zio-quill

    This library does exactly what you prescribe. Pretty sure under the hood it's using macros with string templates

  • spark-cassandra-connector

    DataStax Connector for Apache Spark to Apache Cassandra (by datastax)

  • kyuubi

    Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.

  • InfluxDB

    Collect and Analyze Billions of Data Points in Real Time. Manage all types of time series data in a single, purpose-built database. Run at any scale in any environment in the cloud, on-premises, or at the edge.

  • Jupyter Scala

    A Scala kernel for Jupyter

  • mleap

    MLeap: Deploy ML Pipelines to Production

    Project mention: Machine Learning Pipelines with Spark: Introductory Guide (Part 1) | dev.to | 2022-10-23

    Everything is custom and will take a lot of work, but luckily, you don’t have to do all the work here. In THE second article, you will use MLeap, a library that does the heavy lifting in terms of serializing Spark ML Pipeline for real-time inference and also provides an execution engine for Spark so you can deploy pipelines on non-Spark runtimes.

  • LearningSparkV2

    This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]

    Project mention: datadelivery: Providing public datasets to explore in AWS | dev.to | 2023-04-08

    Learning Spark

  • adam

    ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

    Project mention: biobear -- python package with minimal dependencies for bioinformatic file parsing and querying using rust and polars as the backend | /r/bioinformatics | 2023-04-24

    FYI: ADAM seems to do that

  • H2O

    Sparkling Water provides H2O functionality inside Spark cluster

  • tispark

    TiSpark is built for running Apache Spark on top of TiDB/TiKV

  • frameless

    Expressive types for Spark.

    Project mention: for comprehension and some questions | /r/scala | 2023-01-22

    I don't see how Spark is any "less controversial" when the Spark Delay instance for cats-effect takes an entire SparkSession implicitly.

  • incubator-livy

    Apache Livy is an open source REST interface for interacting with Apache Spark from anywhere.

    Project mention: Sparkless is born | /r/apachespark | 2022-11-24

    Apache Livy - REST API For submitting jobs to a cluster. Used in conjunction with Jupyter or Zeppelin notebooks and you have a multi-tenant SQL only workload — https://livy.apache.org

  • spark-daria

    Essential Spark extensions and helper methods ✨😲

    Project mention: Lakehouse architecture in Azure Synapse without Databricks? | /r/dataengineering | 2023-04-13

    I was a Databricks user for 5 years and spent 95% of my time developing Spark code in IDEs. See the spark-daria and spark-fast-tests projects as Scala examples. I developed internal libraries with all the business logic. The Databricks notebooks would consist of a few lines of code that would invoke a function in the proprietary Spark codebase. The proprietary Spark codebase would depend on the OSS libraries I developed in parallel.

  • delta-sharing

    An open protocol for secure data sharing

    Project mention: Azure data lake - Data Share | /r/dataengineering | 2023-06-29
  • sparkMeasure

    This is the development repository for sparkMeasure, a tool for performance troubleshooting of Apache Spark workloads. It simplifies the collection and analysis of Spark task and stage metrics data.

  • spark-rapids

    Spark RAPIDS plugin - accelerate Apache Spark with GPUs

  • metorikku

    A simplified, lightweight ETL Framework based on Apache Spark

  • spline

    Data Lineage Tracking And Visualization Solution (by AbsaOSS)

    Project mention: Show HN: First open source data discovery and observability platform | news.ycombinator.com | 2022-10-22

    We found a way by leveraging the Spline Agent (https://github.com/AbsaOSS/spline) to make use of the Execution Plans, transform them into a suiting data model for our set of requirements and developed a UI to explore these relationships. We also open-sourced our approach in a

  • spark-solr

    Tools for reading data from Solr as a Spark RDD and indexing objects from Spark into Solr using SolrJ.

    Project mention: How to store 175 million rows and query them | /r/datasets | 2023-05-10
  • spark-fast-tests

    Apache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)

    Project mention: Lakehouse architecture in Azure Synapse without Databricks? | /r/dataengineering | 2023-04-13

    I was a Databricks user for 5 years and spent 95% of my time developing Spark code in IDEs. See the spark-daria and spark-fast-tests projects as Scala examples. I developed internal libraries with all the business logic. The Databricks notebooks would consist of a few lines of code that would invoke a function in the proprietary Spark codebase. The proprietary Spark codebase would depend on the OSS libraries I developed in parallel.

  • SonarLint

    Clean code begins in your IDE with SonarLint. Up your coding game and discover issues early. SonarLint is a free plugin that helps you find & fix bugs and security issues from the moment you start writing code. Install from your favorite IDE marketplace today.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2023-09-12.

Scala Spark related posts

Index

What are some of the best open-source Spark projects in Scala? This list will help you:

Project Stars
1 Apache Spark 36,785
2 SynapseML 4,513
3 spark-nlp 3,429
4 deequ 2,947
5 Quill 2,134
6 spark-cassandra-connector 1,916
7 kyuubi 1,713
8 Jupyter Scala 1,544
9 mleap 1,476
10 LearningSparkV2 997
11 adam 958
12 H2O 954
13 tispark 861
14 frameless 857
15 incubator-livy 818
16 spark-daria 730
17 delta-sharing 638
18 sparkMeasure 594
19 spark-rapids 571
20 metorikku 566
21 spline 538
22 spark-solr 439
23 spark-fast-tests 399
Clean code begins in your IDE with SonarLint
Up your coding game and discover issues early. SonarLint is a free plugin that helps you find & fix bugs and security issues from the moment you start writing code. Install from your favorite IDE marketplace today.
www.sonarlint.org