Manage all types of time series data in a single, purpose-built database. Run at any scale in any environment in the cloud, on-premises, or at the edge. Learn more →
Top 23 Scala Spark Projects
-
Apache Spark - https://spark.apache.org/
-
-
Mergify
Updating dependencies is time-consuming.. Solutions like Dependabot or Renovate update but don't merge dependencies. You need to do it manually while it could be fully automated! Add a Merge Queue to your workflow and stop caring about PR management & merging. Try Mergify for free.
-
Project mention: Spark NLP 5.1.0: Introducing state-of-the-art OpenAI Whisper speech-to-text, OpenAI Embeddings and Completion transformers, MPNet text embeddings, ONNX support for E5 text embeddings, new multi-lingual BART Zero-Shot text classification, and much more! | /r/Python | 2023-09-06
-
deequ
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
There's no straightforward way to drop and rerun a metric collection. For example, say you detect a problem in your data. You fix it, rerun the pipeline, and replace the bad data with the good. You'd want your metrics history to reflect the true state of your data. But the "bad run" cannot be dropped. Issue
-
https://github.com/zio/zio-quill
This library does exactly what you prescribe. Pretty sure under the hood it's using macros with string templates
-
-
kyuubi
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
-
InfluxDB
Collect and Analyze Billions of Data Points in Real Time. Manage all types of time series data in a single, purpose-built database. Run at any scale in any environment in the cloud, on-premises, or at the edge.
-
-
Project mention: Machine Learning Pipelines with Spark: Introductory Guide (Part 1) | dev.to | 2022-10-23
Everything is custom and will take a lot of work, but luckily, you don’t have to do all the work here. In THE second article, you will use MLeap, a library that does the heavy lifting in terms of serializing Spark ML Pipeline for real-time inference and also provides an execution engine for Spark so you can deploy pipelines on non-Spark runtimes.
-
LearningSparkV2
This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
Learning Spark
-
adam
ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.
Project mention: biobear -- python package with minimal dependencies for bioinformatic file parsing and querying using rust and polars as the backend | /r/bioinformatics | 2023-04-24FYI: ADAM seems to do that
-
-
-
I don't see how Spark is any "less controversial" when the Spark Delay instance for cats-effect takes an entire SparkSession implicitly.
-
incubator-livy
Apache Livy is an open source REST interface for interacting with Apache Spark from anywhere.
Apache Livy - REST API For submitting jobs to a cluster. Used in conjunction with Jupyter or Zeppelin notebooks and you have a multi-tenant SQL only workload — https://livy.apache.org
-
Project mention: Lakehouse architecture in Azure Synapse without Databricks? | /r/dataengineering | 2023-04-13
I was a Databricks user for 5 years and spent 95% of my time developing Spark code in IDEs. See the spark-daria and spark-fast-tests projects as Scala examples. I developed internal libraries with all the business logic. The Databricks notebooks would consist of a few lines of code that would invoke a function in the proprietary Spark codebase. The proprietary Spark codebase would depend on the OSS libraries I developed in parallel.
-
-
sparkMeasure
This is the development repository for sparkMeasure, a tool for performance troubleshooting of Apache Spark workloads. It simplifies the collection and analysis of Spark task and stage metrics data.
-
-
-
Project mention: Show HN: First open source data discovery and observability platform | news.ycombinator.com | 2022-10-22
We found a way by leveraging the Spline Agent (https://github.com/AbsaOSS/spline) to make use of the Execution Plans, transform them into a suiting data model for our set of requirements and developed a UI to explore these relationships. We also open-sourced our approach in a
-
spark-solr
Tools for reading data from Solr as a Spark RDD and indexing objects from Spark into Solr using SolrJ.
-
spark-fast-tests
Apache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)
Project mention: Lakehouse architecture in Azure Synapse without Databricks? | /r/dataengineering | 2023-04-13I was a Databricks user for 5 years and spent 95% of my time developing Spark code in IDEs. See the spark-daria and spark-fast-tests projects as Scala examples. I developed internal libraries with all the business logic. The Databricks notebooks would consist of a few lines of code that would invoke a function in the proprietary Spark codebase. The proprietary Spark codebase would depend on the OSS libraries I developed in parallel.
-
SonarLint
Clean code begins in your IDE with SonarLint. Up your coding game and discover issues early. SonarLint is a free plugin that helps you find & fix bugs and security issues from the moment you start writing code. Install from your favorite IDE marketplace today.
Scala Spark related posts
- Spark NLP 5.1.0: Introducing state-of-the-art OpenAI Whisper speech-to-text, OpenAI Embeddings and Completion transformers, MPNet text embeddings, ONNX support for E5 text embeddings, new multi-lingual BART Zero-Shot text classification, and much more!
- Azure data lake - Data Share
- Pandas was faster and less memory intensive then crealytics pyspark. How is it possible?
- The "Big Three's" Data Storage Offerings
- Medallion/lakehouse architecture data modelling
- How to build a data pipeline using Delta Lake
- PySpark for NLP Workshop - Materials and Jupyter Notebooks
-
A note from our sponsor - InfluxDB
www.influxdata.com | 25 Sep 2023
Index
What are some of the best open-source Spark projects in Scala? This list will help you:
Project | Stars | |
---|---|---|
1 | Apache Spark | 36,785 |
2 | SynapseML | 4,513 |
3 | spark-nlp | 3,429 |
4 | deequ | 2,947 |
5 | Quill | 2,134 |
6 | spark-cassandra-connector | 1,916 |
7 | kyuubi | 1,713 |
8 | Jupyter Scala | 1,544 |
9 | mleap | 1,476 |
10 | LearningSparkV2 | 997 |
11 | adam | 958 |
12 | H2O | 954 |
13 | tispark | 861 |
14 | frameless | 857 |
15 | incubator-livy | 818 |
16 | spark-daria | 730 |
17 | delta-sharing | 638 |
18 | sparkMeasure | 594 |
19 | spark-rapids | 571 |
20 | metorikku | 566 |
21 | spline | 538 |
22 | spark-solr | 439 |
23 | spark-fast-tests | 399 |