Top 23 Scala Spark Projects
Apache Spark - A unified analytics engine for large-scale data processingProject mention: What is B2D Sector? | dev.to | 2021-10-17
Example tools:\ Tensorflow, Tableau, Apache Spark, Matlab, Jupyter
An open-source storage layer that brings scalable, ACID transactions to Apache Spark™ and big data workloads. (by delta-io)Project mention: SCD type 2 in spark | reddit.com/r/dataengineering | 2021-10-15
Use Hudi Or Delta Lake
Run Linux Software Faster and Safer than Linux with Unikernels.
Microsoft Machine Learning for Apache SparkProject mention: Machine learning on JVM | reddit.com/r/scala | 2021-04-05
Microsoft ML for Spark gets you a range of powerful ML features on Spark.
State of the Art Natural Language ProcessingProject mention: November 2021 workshops -- please comment about your preferences | reddit.com/r/Clojure | 2021-10-15
Compile-time Language Integrated Queries for Scala (by getquill)Project mention: Scala, 2.12/2.13, which driver/library recommend for connecting to Cassandra | reddit.com/r/scala | 2021-06-19
https://github.com/getquill/quill is my choice. Works like a charm.
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.Project mention: PySpark - How to get Corrupted Records after Casting | reddit.com/r/dataengineering | 2021-09-28
Deequ (this is the Scala version but they have PyDeequ also)
A Scala kernel for JupyterProject mention: EDA libraries for Scala and Spark? | reddit.com/r/scala | 2021-06-23
What about https://github.com/alexarchambault/plotly-scala and https://almond.sh/
Scout APM: A developer's best friend. Try free for 14-days. Scout APM uses tracing logic that ties bottlenecks to source code so you know the exact line of code causing performance issues and can get back to building a great product faster.
Sparkling Water provides H2O functionality inside Spark cluster
Apache Kyuubi is a distributed multi-tenant JDBC server for large-scale data processing and analytics, built on top of Apache SparkProject mention: Release Kyuubi-v1.1.0 | reddit.com/r/apachespark | 2021-03-12
Expressive types for Spark.Project mention: Guide for Apache Spark Setup, Job Optimisation, AWS EMR Cluster Configuration, S3, YARN and HDFS Optimisation | reddit.com/r/apachespark | 2021-04-10
For type safety with dataframes, techniques like https://github.com/typelevel/frameless can be used.
Essential Spark extensions and helper methods ✨😲Project mention: Is Spark - The Defenitive Guide outdated? | reddit.com/r/apachespark | 2021-07-01
They spent a lot of effort improving the catalyst engine under the hood too and making it easier to extend and improve it in the future. Making it easy to add your own native code to Spark itself. Shameless plug of a blog post I wrote on this subject which basically reiterates what Matthew Powers, author of Spark Daria and quinn, wrote here.
A simplified, lightweight ETL Framework based on Apache Spark
This is the development repository of SparkMeasure, a tool for performance troubleshooting of Apache Spark workloads. It simplifies the collection and analysis of Spark task and stage metrics data.Project mention: Spark Write Metrics | reddit.com/r/dataengineering | 2021-07-01
As an alternative to other proposed solutions, you could try and leverage the Spark metrics system to extract this information from accumulators. Metrics include total records and bytes written at each stage, among others. Take a look at SparkMeasure as well as an implementation example if you need to roll your own.
A Scala wrapper for Deeplearning4j, inspired by Keras. Scala + DL + Spark + GPUs
Apache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)Project mention: Show dataengineering: beavis, a library for unit testing Pandas/Dask code | reddit.com/r/dataengineering | 2021-08-09
I am the author of spark-fast-tests and chispa, libraries for unit testing Scala Spark / PySpark code.
A Spark plugin for reading Excel files via Apache POIProject mention: How do I learn to read a plug-in? | reddit.com/r/apachespark | 2021-08-27
Plug-in in question is GitHub - crealytics/spark-excel: A Spark plugin for reading Excel files via Apache POI , but I guess it could be any. Assuming that I can read the plain code in an individual .scala file how do I learn to understand how it all pieces together and what the underlying code being run is?
A Spark UI and Spark History Server alternative with CPU and Memory metrics! Delight is free, cross-platform, and open-source.Project mention: The New & Improved Spark UI & Spark History Server is now Generally Available | dev.to | 2021-05-07
We encourage you to try it out! Sign up, follow the installation instructions on our github page, and let us know your feedback over email (by replying to the welcome email) or using the live chat window in the product.
A Spark/Scala implementation of the isolation forest unsupervised outlier detection algorithm.Project mention: A Spark/Scala implementation of the isolation forest unsupervised outlier detection algorithm. | reddit.com/r/learnmachinelearning | 2021-10-26
An encrypted data analytics platformProject mention: How to Run Spark SQL on Encrypted Data | dev.to | 2021-08-10
Introducing Opaque SQL, an open-source platform for securely running Spark SQL queries on encrypted data. Built by top systems and security researchers at UC Berkeley, the platform uses hardware enclaves to securely execute queries on private data in an untrusted environment.
Boiler plate framework to use Spark and ZIO together.Project mention: Recommendations for specializing in Spark (Scala) | reddit.com/r/scala | 2020-12-22
Snowflake Data Source for Apache Spark.Project mention: Why Databricks Is Winning | news.ycombinator.com | 2021-02-14
Snowflake and Databricks are different, sometimes complementary technologies. You can store data in Snowflake & query it with Databricks for example: https://github.com/snowflakedb/spark-snowflake
Snowflake predicate pushdown filtering seems quite promising: https://www.snowflake.com/blog/snowflake-spark-part-2-pushin...
Think both these companies can win.
C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.
Schema registry for CSV, TSV, JSON, AVRO and Parquet schema. Supports schema inference and GraphQL API.
What are some of the best open-source Spark projects in Scala? This list will help you:
Are you hiring? Post a new remote job listing for free.