Scala Big Data

Open-source Scala projects categorized as Big Data

Top 23 Scala Big Data Projects

  • Apache Spark

    Apache Spark - A unified analytics engine for large-scale data processing

    Project mention: Integrate Pyspark Structured Streaming with confluent-kafka | | 2023-08-12

    Apache Spark -

  • kafka-manager

    CMAK is a tool for managing Apache Kafka clusters

    Project mention: FLaNK Stack Weekly 16 October 2023 | | 2023-10-17
  • Onboard AI

    Learn any GitHub repo in 59 seconds. Onboard AI learns any GitHub repo in minutes and lets you chat with it to locate functionality, understand different parts, and generate new code. Use it for free at

  • SynapseML

    Simple and Distributed Machine Learning

    Project mention: FLaNK Stack Weekly for 12 September 2023 | | 2023-09-12
  • Scalding

    A Scala API for Cascading

  • Scio

    A Scala API for Apache Beam and Google Cloud Dataflow.

    Project mention: Are there any openly available data engineering projects using Scala and Spark which follow industry conventions like proper folder/package structures and object oriented division of classes/concerns? Most examples I’ve seen have everything in one file without proper separation of concerns. | /r/dataengineering | 2023-01-24
  • Jupyter Scala

    A Scala kernel for Jupyter

  • Reactive-kafka

    Alpakka Kafka connector - Alpakka is a Reactive Enterprise Integration library for Java and Scala, based on Reactive Streams and Akka.

  • InfluxDB

    Collect and Analyze Billions of Data Points in Real Time. Manage all types of time series data in a single, purpose-built database. Run at any scale in any environment in the cloud, on-premises, or at the edge.

  • adam

    ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

    Project mention: biobear -- python package with minimal dependencies for bioinformatic file parsing and querying using rust and polars as the backend | /r/bioinformatics | 2023-04-24

    FYI: ADAM seems to do that

  • BIDMach

    CPU and GPU-accelerated Machine Learning Library

  • Gearpump

    Lightweight real-time big data streaming engine over Akka

  • Vegas

    The missing MatPlotLib for Scala + Spark (by vegas-viz)

  • delta-sharing

    An open protocol for secure data sharing

    Project mention: Azure data lake - Data Share | /r/dataengineering | 2023-06-29
  • spark-rapids

    Spark RAPIDS plugin - accelerate Apache Spark with GPUs

  • metorikku

    A simplified, lightweight ETL Framework based on Apache Spark

  • nussknacker

    Low-code tool for automating actions on real time data | Stream processing for the users.

  • Sparkta

    Real Time Analytics and Data Pipelines based on Spark Streaming (by Stratio)

  • Scoobi

    A Scala productivity framework for Hadoop. (by NICTA)

  • qbeast-spark

    Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling. Big Data, free from the unnecessary!

    Project mention: Release 0.3.2 of qbeast-spark! | /r/apachespark | 2023-03-14
  • Clustering4Ever

    C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.

  • Schemer

    Schema registry for CSV, TSV, JSON, AVRO and Parquet schema. Supports schema inference and GraphQL API.

  • Scoozie

    Scala DSL on top of Oozie XML

  • spark-deployer

    Deploy Spark cluster in an easy way.

  • Spark Utils

    Basic framework utilities to quickly start writing production ready Apache Spark applications

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2023-10-17.

Scala Big Data related posts


What are some of the best open-source Big Data projects in Scala? This list will help you:

Project Stars
1 Apache Spark 37,245
2 kafka-manager 11,566
3 SynapseML 4,867
4 Scalding 3,457
5 Scio 2,495
6 Jupyter Scala 1,554
7 Reactive-kafka 1,415
8 adam 960
9 BIDMach 912
10 Gearpump 766
11 Vegas 731
12 delta-sharing 647
13 spark-rapids 592
14 metorikku 568
15 nussknacker 532
16 Sparkta 527
17 Scoobi 482
18 qbeast-spark 167
19 Clustering4Ever 127
20 Schemer 110
21 Scoozie 82
22 spark-deployer 75
23 Spark Utils 35
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives