Onboard AI learns any GitHub repo in minutes and lets you chat with it to locate functionality, understand different parts, and generate new code. Use it for free at www.getonboard.dev. Learn more →
Top 23 Scala Big Data Projects
-
Apache Spark - https://spark.apache.org/
-
-
Onboard AI
Learn any GitHub repo in 59 seconds. Onboard AI learns any GitHub repo in minutes and lets you chat with it to locate functionality, understand different parts, and generate new code. Use it for free at www.getonboard.dev.
-
-
-
Project mention: Are there any openly available data engineering projects using Scala and Spark which follow industry conventions like proper folder/package structures and object oriented division of classes/concerns? Most examples I’ve seen have everything in one file without proper separation of concerns. | /r/dataengineering | 2023-01-24
-
-
Reactive-kafka
Alpakka Kafka connector - Alpakka is a Reactive Enterprise Integration library for Java and Scala, based on Reactive Streams and Akka.
-
InfluxDB
Collect and Analyze Billions of Data Points in Real Time. Manage all types of time series data in a single, purpose-built database. Run at any scale in any environment in the cloud, on-premises, or at the edge.
-
adam
ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.
Project mention: biobear -- python package with minimal dependencies for bioinformatic file parsing and querying using rust and polars as the backend | /r/bioinformatics | 2023-04-24FYI: ADAM seems to do that
-
-
-
-
-
-
-
nussknacker
Low-code tool for automating actions on real time data | Stream processing for the users.
-
-
-
qbeast-spark
Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling. Big Data, free from the unnecessary!
-
Clustering4Ever
C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.
-
Schemer
Schema registry for CSV, TSV, JSON, AVRO and Parquet schema. Supports schema inference and GraphQL API.
-
-
-
Spark Utils
Basic framework utilities to quickly start writing production ready Apache Spark applications
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Scala Big Data related posts
- Azure data lake - Data Share
- The "Big Three's" Data Storage Offerings
- Medallion/lakehouse architecture data modelling
- How to build a data pipeline using Delta Lake
- whenNotMatchedBySourceUpdate not existing? Trying to upsert parquet into Delta table
- Delta.io/deltalake self hosting
- Delta.io/deltalake self hosting
-
A note from our sponsor - Onboard AI
getonboard.dev | 30 Nov 2023
Index
What are some of the best open-source Big Data projects in Scala? This list will help you:
Project | Stars | |
---|---|---|
1 | Apache Spark | 37,245 |
2 | kafka-manager | 11,566 |
3 | SynapseML | 4,867 |
4 | Scalding | 3,457 |
5 | Scio | 2,495 |
6 | Jupyter Scala | 1,554 |
7 | Reactive-kafka | 1,415 |
8 | adam | 960 |
9 | BIDMach | 912 |
10 | Gearpump | 766 |
11 | Vegas | 731 |
12 | delta-sharing | 647 |
13 | spark-rapids | 592 |
14 | metorikku | 568 |
15 | nussknacker | 532 |
16 | Sparkta | 527 |
17 | Scoobi | 482 |
18 | qbeast-spark | 167 |
19 | Clustering4Ever | 127 |
20 | Schemer | 110 |
21 | Scoozie | 82 |
22 | spark-deployer | 75 |
23 | Spark Utils | 35 |