Top 23 Scala Big Data Projects

Apache Spark

101 38,320 10.0 Scala

Apache Spark - A unified analytics engine for large-scale data processing

Project mention: "xAI will open source Grok" | news.ycombinator.com | 2024-03-11

kafka-manager

13 11,670 0.0 Scala

CMAK is a tool for managing Apache Kafka clusters

Project mention: FLaNK Stack Weekly 16 October 2023 | dev.to | 2023-10-17

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
delta

69 6,897 9.8 Scala

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)

Project mention: Delta Lake vs. Parquet: A Comparison | news.ycombinator.com | 2024-01-19

Delta is pretty great, let's you do upserts into tables in DataBricks much easier than without it.
I think the website is here: https://delta.io

SynapseML

18 4,967 8.9 Scala

Simple and Distributed Machine Learning

Project mention: FLaNK Stack Weekly for 12 September 2023 | dev.to | 2023-09-12

Scalding

0 3,470 2.5 Scala

A Scala API for Cascading
Scio

7 2,520 9.6 Scala

A Scala API for Apache Beam and Google Cloud Dataflow.
Jupyter Scala

6 1,562 9.0 Scala

A Scala kernel for Jupyter
InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
Reactive-kafka

0 1,418 8.2 Scala

Alpakka Kafka connector - Alpakka is a Reactive Enterprise Integration library for Java and Scala, based on Reactive Streams and Akka.
adam

3 967 6.1 Scala

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.
H2O

0 952 7.7 Scala

Sparkling Water provides H2O functionality inside Spark cluster
BIDMach

0 913 0.0 Scala

CPU and GPU-accelerated Machine Learning Library
Gearpump

0 765 0.0 Scala

Lightweight real-time big data streaming engine over Akka
Vegas

0 729 0.0 Scala

The missing MatPlotLib for Scala + Spark (by vegas-viz)
spark-rapids

3 720 9.8 Scala

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
delta-sharing

4 674 7.9 Scala

An open protocol for secure data sharing

Project mention: Azure data lake - Data Share | /r/dataengineering | 2023-06-29

nussknacker

1 609 9.8 Scala

Low-code tool for automating actions on real time data | Stream processing for the users.
metorikku

0 576 2.4 Scala

A simplified, lightweight ETL Framework based on Apache Spark
Sparkta

0 524 0.0 Scala

Real Time Analytics and Data Pipelines based on Spark Streaming (by Stratio)
Scoobi

0 482 0.0 Scala

A Scala productivity framework for Hadoop. (by NICTA)
qbeast-spark

12 190 8.6 Scala

Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling. Big Data, free from the unnecessary!
Clustering4Ever

0 128 0.0 Scala

C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.
Schemer

0 112 0.0 Scala

Schema registry for CSV, TSV, JSON, AVRO and Parquet schema. Supports schema inference and GraphQL API.
Scoozie

0 82 0.0 Scala

Scala DSL on top of Oozie XML
SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Scala Big Data related posts

Azure data lake - Data Share
1 project | /r/dataengineering | 29 Jun 2023
The "Big Three's" Data Storage Offerings
2 projects | /r/dataengineering | 15 Jun 2023
Medallion/lakehouse architecture data modelling
1 project | /r/dataengineering | 3 Jun 2023
How to build a data pipeline using Delta Lake
2 projects | dev.to | 19 May 2023
whenNotMatchedBySourceUpdate not existing? Trying to upsert parquet into Delta table
1 project | /r/apachespark | 10 May 2023
Delta.io/deltalake self hosting
2 projects | /r/bigdata | 26 Apr 2023
Delta.io/deltalake self hosting
1 project | /r/DeltaLake | 25 Apr 2023
A note from our sponsor - WorkOS
workos.com | 26 Apr 2024

The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →

Index

What are some of the best open-source Big Data projects in Scala? This list will help you:

	Project	Stars
1	Apache Spark	38,320
2	kafka-manager	11,670
3	delta	6,897
4	SynapseML	4,967
5	Scalding	3,470
6	Scio	2,520
7	Jupyter Scala	1,562
8	Reactive-kafka	1,418
9	adam	967
10	H2O	952
11	BIDMach	913
12	Gearpump	765
13	Vegas	729
14	spark-rapids	720
15	delta-sharing	674
16	nussknacker	609
17	metorikku	576
18	Sparkta	524
19	Scoobi	482
20	qbeast-spark	190
21	Clustering4Ever	128
22	Schemer	112
23	Scoozie	82