Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure. Learn more →
Top 23 Scala Big Data Projects
-
Project mention: Introducing RisingWave's Hosted Iceberg Catalog-No External Setup Needed | dev.to | 2025-07-04
Because the hosted catalog is a standard JDBC catalog, tools like Spark, Trino, and Flink can still access your tables. For example:
-
InfluxDB
InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
-
-
delta
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)
Project mention: Twitter's 600-Tweet Daily Limit Crisis: Soaring GCP Costs and the Open Source Fix Elon Musk Ignored | dev.to | 2025-04-10Delta Lake: Delta Lake is an open-source storage layer that provides ACID transactions, scalable metadata management, and data versioning on top of existing data lakes. It aims to bring reliability and performance optimizations to big data workloads while ensuring data integrity and consistency.
-
> to see how they ended up in that situation
The "how" is almost always lack of discipline (or as I sometimes couch it, "imagination") but usually shit like https://github.com/microsoft/SynapseML/issues/405#:~:text=cl...
-
-
-
If you're looking for more modern notebooks supporting Scala (and Spark):
- https://almond.sh
- https://polynote.org
Toree is mostly dead but might also get a Scala 2.13 release now that Spark 4.0 is approaching.
-
Stream
Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
-
Reactive-kafka
Alpakka Kafka connector - Alpakka is a Reactive Enterprise Integration library for Java and Scala, based on Reactive Streams and Akka.
-
adam
ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.
-
-
-
Project mention: Launch HN: ParaQuery (YC X25) – GPU Accelerated Spark/SQL | news.ycombinator.com | 2025-05-12
-
-
-
-
nussknacker
Low-code tool for automating actions on real time data | Stream processing for the users.
-
-
-
-
qbeast-spark
Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling. Big Data, free from the unnecessary!
-
Clustering4Ever
C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.
-
Schemer
Schema registry for CSV, TSV, JSON, AVRO and Parquet schema. Supports schema inference and GraphQL API.
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Scala Big Data discussion
Scala Big Data related posts
-
Data Engineering with Scala: Mastering Real-Time Data Processing with Apache Flink and Google Pub/Sub
-
Engenharia de Dados com Scala: masterizando o processamento de dados em tempo real com Apache Flink e Google Pub/Sub
-
Make Rust Object Oriented with the dual-trait pattern
-
Ask AN: Anyone using Delta Sharing in production?
-
Azure data lake - Data Share
-
The "Big Three's" Data Storage Offerings
-
Medallion/lakehouse architecture data modelling
-
A note from our sponsor - Stream
getstream.io | 13 Jul 2025
Index
What are some of the best open-source Big Data projects in Scala? This list will help you:
# | Project | Stars |
---|---|---|
1 | Apache Spark | 41,441 |
2 | kafka-manager | 11,894 |
3 | delta | 8,144 |
4 | SynapseML | 5,146 |
5 | Scalding | 3,515 |
6 | Scio | 2,606 |
7 | Jupyter Scala | 1,618 |
8 | Reactive-kafka | 1,415 |
9 | adam | 1,027 |
10 | H2O | 973 |
11 | BIDMach | 914 |
12 | spark-rapids | 909 |
13 | delta-sharing | 850 |
14 | Gearpump | 762 |
15 | Vegas | 728 |
16 | nussknacker | 692 |
17 | metorikku | 585 |
18 | Sparkta | 526 |
19 | Scoobi | 482 |
20 | qbeast-spark | 231 |
21 | Clustering4Ever | 130 |
22 | Schemer | 112 |
23 | spark-deployer | 76 |