Scala Big Data

Open-source Scala projects categorized as Big Data

Top 23 Scala Big Data Projects

  1. Apache Spark

    Apache Spark - A unified analytics engine for large-scale data processing

    Project mention: Introducing RisingWave's Hosted Iceberg Catalog-No External Setup Needed | dev.to | 2025-07-04

    Because the hosted catalog is a standard JDBC catalog, tools like Spark, Trino, and Flink can still access your tables. For example:

  2. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
  3. kafka-manager

    CMAK is a tool for managing Apache Kafka clusters

  4. delta

    An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)

    Project mention: Twitter's 600-Tweet Daily Limit Crisis: Soaring GCP Costs and the Open Source Fix Elon Musk Ignored | dev.to | 2025-04-10

    Delta Lake: Delta Lake is an open-source storage layer that provides ACID transactions, scalable metadata management, and data versioning on top of existing data lakes. It aims to bring reliability and performance optimizations to big data workloads while ensuring data integrity and consistency.

  5. SynapseML

    Simple and Distributed Machine Learning

    Project mention: The Grug Brained Developer | news.ycombinator.com | 2025-06-17

    > to see how they ended up in that situation

    The "how" is almost always lack of discipline (or as I sometimes couch it, "imagination") but usually shit like https://github.com/microsoft/SynapseML/issues/405#:~:text=cl...

  6. Scalding

    A Scala API for Cascading

  7. Scio

    A Scala API for Apache Beam and Google Cloud Dataflow.

  8. Jupyter Scala

    A Scala kernel for Jupyter

    Project mention: Apache Zeppelin | news.ycombinator.com | 2024-09-02

    If you're looking for more modern notebooks supporting Scala (and Spark):

    - https://almond.sh

    - https://polynote.org

    Toree is mostly dead but might also get a Scala 2.13 release now that Spark 4.0 is approaching.

  9. Stream

    Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.

    Stream logo
  10. Reactive-kafka

    Alpakka Kafka connector - Alpakka is a Reactive Enterprise Integration library for Java and Scala, based on Reactive Streams and Akka.

  11. adam

    ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

  12. H2O

    Sparkling Water provides H2O functionality inside Spark cluster

  13. BIDMach

    CPU and GPU-accelerated Machine Learning Library

  14. spark-rapids

    Spark RAPIDS plugin - accelerate Apache Spark with GPUs

    Project mention: Launch HN: ParaQuery (YC X25) – GPU Accelerated Spark/SQL | news.ycombinator.com | 2025-05-12
  15. delta-sharing

    An open protocol for secure data sharing

  16. Gearpump

    Lightweight real-time big data streaming engine over Akka

  17. Vegas

    The missing MatPlotLib for Scala + Spark (by vegas-viz)

  18. nussknacker

    Low-code tool for automating actions on real time data | Stream processing for the users.

  19. metorikku

    A simplified, lightweight ETL Framework based on Apache Spark

  20. Sparkta

    Real Time Analytics and Data Pipelines based on Spark Streaming (by Stratio)

  21. Scoobi

    A Scala productivity framework for Hadoop. (by NICTA)

  22. qbeast-spark

    Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling. Big Data, free from the unnecessary!

  23. Clustering4Ever

    C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.

  24. Schemer

    Schema registry for CSV, TSV, JSON, AVRO and Parquet schema. Supports schema inference and GraphQL API.

  25. spark-deployer

    Deploy Spark cluster in an easy way.

  26. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Scala Big Data discussion

Log in or Post with

Scala Big Data related posts

  • Data Engineering with Scala: Mastering Real-Time Data Processing with Apache Flink and Google Pub/Sub

    3 projects | dev.to | 17 Oct 2024
  • Engenharia de Dados com Scala: masterizando o processamento de dados em tempo real com Apache Flink e Google Pub/Sub

    3 projects | dev.to | 8 Aug 2024
  • Make Rust Object Oriented with the dual-trait pattern

    2 projects | dev.to | 8 Jul 2024
  • Ask AN: Anyone using Delta Sharing in production?

    1 project | news.ycombinator.com | 1 Jul 2024
  • Azure data lake - Data Share

    1 project | /r/dataengineering | 29 Jun 2023
  • The "Big Three's" Data Storage Offerings

    2 projects | /r/dataengineering | 15 Jun 2023
  • Medallion/lakehouse architecture data modelling

    1 project | /r/dataengineering | 3 Jun 2023
  • A note from our sponsor - Stream
    getstream.io | 13 Jul 2025
    Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure. Learn more →

Index

What are some of the best open-source Big Data projects in Scala? This list will help you:

# Project Stars
1 Apache Spark 41,441
2 kafka-manager 11,894
3 delta 8,144
4 SynapseML 5,146
5 Scalding 3,515
6 Scio 2,606
7 Jupyter Scala 1,618
8 Reactive-kafka 1,415
9 adam 1,027
10 H2O 973
11 BIDMach 914
12 spark-rapids 909
13 delta-sharing 850
14 Gearpump 762
15 Vegas 728
16 nussknacker 692
17 metorikku 585
18 Sparkta 526
19 Scoobi 482
20 qbeast-spark 231
21 Clustering4Ever 130
22 Schemer 112
23 spark-deployer 76

Sponsored
InfluxDB – Built for High-Performance Time Series Workloads
InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
www.influxdata.com

Did you know that Scala is
the 32nd most popular programming language
based on number of references?