Zigi monitors Jira and GitHub updates, pings you when PRs need approval and lets you take fast actions - all directly from Slack! Plus it reduces cycle time by up to 75%. Learn more →
Top 23 Scala Spark Projects
Apache Spark - A unified analytics engine for large-scale data processingProject mention: What is the separation of storage and compute in data platforms and why does it matter? | dev.to | 2022-11-29
However, once your data reaches a certain size or you reach the limits of vertical scaling, it may be necessary to distribute your queries across a cluster, or scale horizontally. This is where distributed query engines like Trino and Spark come in. Distributed query engines make use of a coordinator to plan the query and multiple worker nodes to execute them in parallel.
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)Project mention: The Evolution of the Data Engineer Role | news.ycombinator.com | 2022-10-24
FACT table (sale $, order quantity, order ID, product ID)
customer, account_types etc are dimensions to filter your low-level transactional data. The schema like a snowflake when you add enough dimensions, hence the name.
The FACT table makes "measures" available to the user. Example: Count of Orders. These are based on the values in the FACT table (your big table of IDs that link to dimensions and low-level transactional data).
You can then slice and dice your count of orders by fields in the dimensions.
You could then add Sum of Sale ($) as an additional measure. "Abstract" measures like Average Sale ($) per Order can also be added in the OLTP backend engine.
End users will often be using Excel or Tableau to create their own dashboards / graphs / reports. This pattern makes sense in that case --> user can explore the heavily structured business data according to all the pre-existing business rules.
- Great for enterprise businesses with existing application databases
- Highly structured and transaction support (ACID compliance)
- Ease of use for end business user (create a new pivot table in Excel)
- Easy to query (basically a bunch of SQL queries)
- Encapsulates all your business rules in one place -- a.k.a. single source of truth.
- Massive start up cost (have to work out the schema before you even write any code)
- Slow to change (imagine if the raw transaction amounts suddenly changed to £ after a certain date!)
- Massive nightly ETL jobs (these break fairly often)
- Usually proprietary tooling / storage (think MS SQL Server)
2. Data Lake
Throw everything into an S3 bucket. Database table? Throw it into the S3 bucket. Image data? Throw it into the S3 bucket. Kitchen sink? Throw it into the S3 bucket.
Process your data when you're ready to process it. Read in your data from S3, process it, write back to S3 as an "output file" for downstream consumption.
- Easy to set up
- Fairly simple and standardised i/o (S3 apis work with pandas and pyspark dataframes etc)
- Can store data remotely until ready to process it
- Highly flexible as mostly unstructured (create new S3 keys -- a.k.a. directories -- on the fly )
- Cheap storage
- Doesn't scale -- turns into a "data swamp"
- Not always ACID compliant (looking at you Digital Ocean)
- Very easy to duplicate data
3. Data Lakehouse
Essentially a data lake with some additional bits.
A. Delta Lake Storage Format a.k.a. Delta Tables
Versioned files acting like versioned tables. Writing to a file will create a new version of the file, with previous versions stored for a set number of updates. Appending to the file creates a new version of the file in the same way (e.g. add a new order streamed in from the ordering application).
Every file -- a.k.a. delta table -- becomes ACID compliant. You can rollback the table to last week and replay e.g. because change X caused bug Y to happen.
AWS does allow you do this, but it was a right ol' pain in the arse whenever I had to deal with massively partitioned parquet files. Delta Lake makes versioning the outputs much easier and it is much easier to rollback.
B. Data Storage Layout
Enforce a schema based on processing stages to get some performance & data governance benefits.
Example processing stage schema: DATA IN -> EXTRACT -> TRANSFORM -> AGGREGATE -> REPORTABLE
Or the "medallion" schema: Bronze -> Silver -> Gold.
Write out the data at each processing stage to a delta lake table/file. You can now query 5x data sources instead of 2x. The table's rarity indicates the degree of "data enrichment" you have performed -- i.e. how useful have you made the data. Want to update the codebase for the AGGREGATE stage? Just rerun from the TRANSFORM table (rather than run it all from scratch). This also acts as a caching layer. In a Data Warehouse, the entire query needs to be run from scratch each time you change a field. Here, you could just deliver the REPORTABLE tables as artefacts whenever you change them.
C. "Metadata" Tracking
See AWS Glue Data Catalog.
Index files that match a specific S3 key pattern and/or file format and/or AWS S3 tag etc. throughout your S3 bucket. Store the results in a publicly accessible table. Now you can perform SQL queries against the metadata of your data. Want to find that file you were working on last week? Run a query based on last modified time. Want to find files that contain a specific column name? Run a query based on column names.
- transactional versioning -- ACID compliance and the ability to rollback data over time (I accidentally deleted an entire column of data / calculated VAT wrong yesterday)
- processing-stage schema storage layout acts as a caching layer (only process from the stage where you need to)
- no need for humans to remember the specific path to the files they were working on as files are all indexed
- less chance of creating a "data swamp"
- changes become easier to audit as you can track the changes between versions
- Delta lake table format is only really available with Apache Spark / Databricks processing engines (mostly, for now)
- Requires enforcement of the processing-stage schema (your data scientists will just ignore you when you request they start using it)
- More setup cost than a simple data lake
- Basically a move back towards proprietary tooling (some FOSS libs are starting to pop up for it)
4. Data Mesh
geoduck14's answer on this was pretty good. basically have a data infrastructure team, and them domain-specific teams that spring up as needed (like an infra team looking after your k8s clusters, and application teams that use the clusters). domain specific data team use the data platform provided by the data infrastructure team.
Previously worked somewhere in a "product" team which basically performed this function. They just didn't call it a "data mesh".
Clean code begins in your IDE with SonarLint. Up your coding game and discover issues early. SonarLint is a free plugin that helps you find & fix bugs and security issues from the moment you start writing code. Install from your favorite IDE marketplace today.
Simple and Distributed Machine LearningProject mention: Data science in Scala | reddit.com/r/scala | 2022-11-05
b) There are libraries around e.g. Microsoft SynapseML, LinkedIn Photon ML
State of the Art Natural Language ProcessingProject mention: Data science in Scala | reddit.com/r/scala | 2022-11-05
I am not aware of common open frameworks like Tensorflow, PyTorch or Scikit-learn for Scala. But specifically for natural language processing, there's SparkNLP from John Snow Labs.
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.Project mention: deequ VS cuallee - a user suggested alternative | libhunt.com/r/deequ | 2022-11-30
Compile-time Language Integrated Queries for ScalaProject mention: What's the point of opaque type aliases (and are they actually sound)? | reddit.com/r/scala | 2022-11-26
Just as an example, say you are using quill ( https://getquill.io/ ) to query your database.
DataStax Spark Cassandra Connector (by datastax)Project mention: Reading from cassandra in Spark does not return all the data when using JoinWithCassandraTable | reddit.com/r/apachespark | 2022-03-09
This works perfectly fine and I get all the data I'm expecting. However if I change spark.cassandra.sql.inClauseToJoinConversionThreshold(see https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md) to something lower like 20 which means I hit the threshold (my cross-product is 10*10=100) and JoinWithCassandraTable will be used. I suddenly do not get all the data, and on top of that I get duplicated rows for some of the data. It looks like I'm completely missing some of the partition keys, and some of the partition keys return duplicated rows (this quick-analysis might however be wrong).
Close all those tabs. Zigi will handle your updates.. Zigi monitors Jira and GitHub updates, pings you when PRs need approval and lets you take fast actions - all directly from Slack! Plus it reduces cycle time by up to 75%.
A Scala kernel for JupyterProject mention: 💐 Making VSCode itself a Java REPL 🔁 | reddit.com/r/java | 2022-09-05
MLeap: Deploy ML Pipelines to ProductionProject mention: Machine Learning Pipelines with Spark: Introductory Guide (Part 1) | dev.to | 2022-10-23
Everything is custom and will take a lot of work, but luckily, you don’t have to do all the work here. In THE second article, you will use MLeap, a library that does the heavy lifting in terms of serializing Spark ML Pipeline for real-time inference and also provides an execution engine for Spark so you can deploy pipelines on non-Spark runtimes.
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.Project mention: Advanced Scientific Data Format | news.ycombinator.com | 2022-09-30
We presented using Parquet formats for bioinformatics 2012/13-ish at the Bioinformatics Open Source Conference (BOSC) and got laughed out of the place.
While using Apache Spark for bioinformatics  never really took off, I still think Parquet formats for bioinformatics  is a good idea, especially with DuckDB, Apache Arrow, etc. supporting Parquet out of the box.
Sparkling Water provides H2O functionality inside Spark cluster
TiSpark is built for running Apache Spark on top of TiDB/TiKVProject mention: A simple way to import TiSpark into Databricks to load TiDB data | dev.to | 2022-09-16
dbutils.fs.mkdirs("dbfs:/databricks/scripts/") dbutils.fs.put( "/databricks/scripts/tispark-install.sh", """ #!/bin/bash wget --quiet -O /mnt/driver-daemon/jars/tispark-assembly-3.2_2.12-3.1.0-SNAPSHOT.jar https://github.com/pingcap/tispark/releases/download/v3.1.0/tispark-assembly-3.2_2.12-3.1.0.jar """, True)
Expressive types for Spark.Project mention: Why use Spark at all? | reddit.com/r/dataengineering | 2022-10-19
To add to this I lately have used Spark with frameless for compile time safety and it's an interesting library that works well with Spark.
Essential Spark extensions and helper methods ✨😲
A simplified, lightweight ETL Framework based on Apache Spark
This is the development repository for sparkMeasure, a tool for performance troubleshooting of Apache Spark workloads. It simplifies the collection and analysis of Spark task and stage metrics data.
Spark RAPIDS plugin - accelerate Apache Spark with GPUs
Data Lineage Tracking And Visualization Solution (by AbsaOSS)Project mention: Show HN: First open source data discovery and observability platform | news.ycombinator.com | 2022-10-22
We found a way by leveraging the Spline Agent (https://github.com/AbsaOSS/spline) to make use of the Execution Plans, transform them into a suiting data model for our set of requirements and developed a UI to explore these relationships. We also open-sourced our approach in a
Tools for reading data from Solr as a Spark RDD and indexing objects from Spark into Solr using SolrJ.Project mention: Alternatives to update by query | reddit.com/r/Solr | 2022-04-06
You could use Spark
Apache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)Project mention: Well designed scala/spark project | reddit.com/r/scala | 2022-10-15
A Scala wrapper for Deeplearning4j, inspired by Keras. Scala + DL + Spark + GPUs
A Spark plugin for reading and writing Excel filesProject mention: Automating Excel to Databricks Table | reddit.com/r/dataengineering | 2022-09-18
Not natively. But the com.crealytics.spark.excel library has had great results for us. There are still some cases where pandas manipulation is needed with Excel files that have weird header setups.
Build time-series-based applications quickly and at scale.. InfluxDB is the Time Series Data Platform where developers build real-time applications for analytics, IoT and cloud-native services in less time with less code.
Scala Spark related posts
deequ VS cuallee - a user suggested alternative
2 projects | 30 Nov 2022
Say goodbye to data silos Explore Qubole’s open, and secure multi-cloud data lake to get faster access to petabytes of datasets
1 project | reddit.com/r/u_Qubole-US | 15 Nov 2022
Data science in Scala
5 projects | reddit.com/r/scala | 5 Nov 2022
The Evolution of the Data Engineer Role
2 projects | news.ycombinator.com | 24 Oct 2022
Why use Spark at all?
2 projects | reddit.com/r/dataengineering | 19 Oct 2022
Well designed scala/spark project
4 projects | reddit.com/r/scala | 15 Oct 2022
Spark-NLP 4.2.0: Wav2Vec2 for Automatic Speech Recognition (ASR), TAPAS for Table Question Answering, CamemBERT for Token Classification, new evaluation metrics for external datasets in all classifiers, much faster EntityRuler, over 3000+ state-of-the-art models & pipelines, and many more!
1 project | reddit.com/r/apachespark | 28 Sep 2022
A note from our sponsor - Zigi
www.zigi.ai | 3 Dec 2022
What are some of the best open-source Spark projects in Scala? This list will help you: