Scala Big Data

Open-source Scala projects categorized as Big Data

Top 23 Scala Big Data Projects

  • Apache Spark

    Apache Spark - A unified analytics engine for large-scale data processing

    Project mention: Deequ for generating data quality reports | dev.to | 2022-11-24

    aws documentation — Deequ allows you to calculate data quality metrics on your dataset, define and verify data quality constraints, and be informed about changes in the data distribution. Instead of implementing checks and verification algorithms on your own, you can focus on describing how your data should look. Deequ supports you by suggesting checks for you. Deequ is implemented on top of Apache Spark and is designed to scale with large datasets (think billions of rows) that typically live in a distributed filesystem or a data warehouse.

  • kafka-manager

    CMAK is a tool for managing Apache Kafka clusters

    Project mention: Running multi-broker Kafka using docker | reddit.com/r/apachekafka | 2022-09-27

    Dockerized kafka manager (Yahoo CMAK)

  • SonarLint

    Clean code begins in your IDE with SonarLint. Up your coding game and discover issues early. SonarLint is a free plugin that helps you find & fix bugs and security issues from the moment you start writing code. Install from your favorite IDE marketplace today.

  • delta

    An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)

    Project mention: The Evolution of the Data Engineer Role | news.ycombinator.com | 2022-10-24

    FACT table (sale $, order quantity, order ID, product ID)

    customer, account_types etc are dimensions to filter your low-level transactional data. The schema like a snowflake when you add enough dimensions, hence the name.

    The FACT table makes "measures" available to the user. Example: Count of Orders. These are based on the values in the FACT table (your big table of IDs that link to dimensions and low-level transactional data).

    You can then slice and dice your count of orders by fields in the dimensions.

    You could then add Sum of Sale ($) as an additional measure. "Abstract" measures like Average Sale ($) per Order can also be added in the OLTP backend engine.

    End users will often be using Excel or Tableau to create their own dashboards / graphs / reports. This pattern makes sense in that case --> user can explore the heavily structured business data according to all the pre-existing business rules.

    Pros:

    - Great for enterprise businesses with existing application databases

    - Highly structured and transaction support (ACID compliance)

    - Ease of use for end business user (create a new pivot table in Excel)

    - Easy to query (basically a bunch of SQL queries)

    - Encapsulates all your business rules in one place -- a.k.a. single source of truth.

    Cons

    - Massive start up cost (have to work out the schema before you even write any code)

    - Slow to change (imagine if the raw transaction amounts suddenly changed to £ after a certain date!)

    - Massive nightly ETL jobs (these break fairly often)

    - Usually proprietary tooling / storage (think MS SQL Server)

    ---

    2. Data Lake

    Throw everything into an S3 bucket. Database table? Throw it into the S3 bucket. Image data? Throw it into the S3 bucket. Kitchen sink? Throw it into the S3 bucket.

    Process your data when you're ready to process it. Read in your data from S3, process it, write back to S3 as an "output file" for downstream consumption.

    Pros:

    - Easy to set up

    - Fairly simple and standardised i/o (S3 apis work with pandas and pyspark dataframes etc)

    - Can store data remotely until ready to process it

    - Highly flexible as mostly unstructured (create new S3 keys -- a.k.a. directories -- on the fly )

    - Cheap storage

    Cons:

    - Doesn't scale -- turns into a "data swamp"

    - Not always ACID compliant (looking at you Digital Ocean)

    - Very easy to duplicate data

    ---

    3. Data Lakehouse

    Essentially a data lake with some additional bits.

    A. Delta Lake Storage Format a.k.a. Delta Tables

    https://delta.io

    Versioned files acting like versioned tables. Writing to a file will create a new version of the file, with previous versions stored for a set number of updates. Appending to the file creates a new version of the file in the same way (e.g. add a new order streamed in from the ordering application).

    Every file -- a.k.a. delta table -- becomes ACID compliant. You can rollback the table to last week and replay e.g. because change X caused bug Y to happen.

    AWS does allow you do this, but it was a right ol' pain in the arse whenever I had to deal with massively partitioned parquet files. Delta Lake makes versioning the outputs much easier and it is much easier to rollback.

    B. Data Storage Layout

    Enforce a schema based on processing stages to get some performance & data governance benefits.

    Example processing stage schema: DATA IN -> EXTRACT -> TRANSFORM -> AGGREGATE -> REPORTABLE

    Or the "medallion" schema: Bronze -> Silver -> Gold.

    Write out the data at each processing stage to a delta lake table/file. You can now query 5x data sources instead of 2x. The table's rarity indicates the degree of "data enrichment" you have performed -- i.e. how useful have you made the data. Want to update the codebase for the AGGREGATE stage? Just rerun from the TRANSFORM table (rather than run it all from scratch). This also acts as a caching layer. In a Data Warehouse, the entire query needs to be run from scratch each time you change a field. Here, you could just deliver the REPORTABLE tables as artefacts whenever you change them.

    C. "Metadata" Tracking

    See AWS Glue Data Catalog.

    Index files that match a specific S3 key pattern and/or file format and/or AWS S3 tag etc. throughout your S3 bucket. Store the results in a publicly accessible table. Now you can perform SQL queries against the metadata of your data. Want to find that file you were working on last week? Run a query based on last modified time. Want to find files that contain a specific column name? Run a query based on column names.

    Pros:

    - transactional versioning -- ACID compliance and the ability to rollback data over time (I accidentally deleted an entire column of data / calculated VAT wrong yesterday)

    - processing-stage schema storage layout acts as a caching layer (only process from the stage where you need to)

    - no need for humans to remember the specific path to the files they were working on as files are all indexed

    - less chance of creating a "data swamp"

    - changes become easier to audit as you can track the changes between versions

    Cons:

    - Delta lake table format is only really available with Apache Spark / Databricks processing engines (mostly, for now)

    - Requires enforcement of the processing-stage schema (your data scientists will just ignore you when you request they start using it)

    - More setup cost than a simple data lake

    - Basically a move back towards proprietary tooling (some FOSS libs are starting to pop up for it)

    ---

    4. Data Mesh

    geoduck14's answer on this was pretty good. basically have a data infrastructure team, and them domain-specific teams that spring up as needed (like an infra team looking after your k8s clusters, and application teams that use the clusters). domain specific data team use the data platform provided by the data infrastructure team.

    Previously worked somewhere in a "product" team which basically performed this function. They just didn't call it a "data mesh".

  • SynapseML

    Simple and Distributed Machine Learning

    Project mention: Data science in Scala | reddit.com/r/scala | 2022-11-05

    b) There are libraries around e.g. Microsoft SynapseML, LinkedIn Photon ML

  • Scalding

    A Scala API for Cascading

  • Scio

    A Scala API for Apache Beam and Google Cloud Dataflow.

    Project mention: For the DE's that choose Java over Python in new projects, why? | reddit.com/r/dataengineering | 2022-06-02

    I doubt it is possible because I suspect that GIL would like a word. So I could spend nights trying to make it work in Python (and possibly, if not likely, fail). Or I could just use this ready made solution.

  • Jupyter Scala

    A Scala kernel for Jupyter

    Project mention: 💐 Making VSCode itself a Java REPL 🔁 | reddit.com/r/java | 2022-09-05

    Checkout almond

  • InfluxDB

    Build time-series-based applications quickly and at scale.. InfluxDB is the Time Series Data Platform where developers build real-time applications for analytics, IoT and cloud-native services in less time with less code.

  • Reactive-kafka

    Alpakka Kafka connector - Alpakka is a Reactive Enterprise Integration library for Java and Scala, based on Reactive Streams and Akka.

  • adam

    ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

    Project mention: Advanced Scientific Data Format | news.ycombinator.com | 2022-09-30

    We presented using Parquet formats for bioinformatics 2012/13-ish at the Bioinformatics Open Source Conference (BOSC) and got laughed out of the place.

    While using Apache Spark for bioinformatics [0] never really took off, I still think Parquet formats for bioinformatics [1] is a good idea, especially with DuckDB, Apache Arrow, etc. supporting Parquet out of the box.

    0 - https://github.com/bigdatagenomics/adam

    1 - https://github.com/bigdatagenomics/bdg-formats

  • BIDMach

    CPU and GPU-accelerated Machine Learning Library

  • Gearpump

    Lightweight real-time big data streaming engine over Akka

  • Vegas

    The missing MatPlotLib for Scala + Spark (by vegas-viz)

  • metorikku

    A simplified, lightweight ETL Framework based on Apache Spark

  • Sparkta

    Real Time Analytics and Data Pipelines based on Spark Streaming (by Stratio)

  • Scoobi

    A Scala productivity framework for Hadoop. (by NICTA)

  • spark-rapids

    Spark RAPIDS plugin - accelerate Apache Spark with GPUs

  • nussknacker

    A visual tool to define and run real-time decision algorithms. Brings agility to business teams, liberates developers to focus on technology.

  • Clustering4Ever

    C4E, a JVM friendly library written in Scala for both local and distributed (Spark) Clustering.

  • qbeast-spark

    Qbeast-spark: DataSource enabling multi-dimensional indexing and efficient data sampling. Big Data, free from the unnecessary!

    Project mention: Collaborative roadmap for qbeast-spark: Open Source Table Format | reddit.com/r/apachespark | 2022-06-07

    We want to develop qbeast-spark in an open way, so we publish a tentative Roadmap for this summer https://github.com/Qbeast-io/qbeast-spark/discussions/108

  • Schemer

    Schema registry for CSV, TSV, JSON, AVRO and Parquet schema. Supports schema inference and GraphQL API.

  • Scoozie

    Scala DSL on top of Oozie XML

  • spark-deployer

    Deploy Spark cluster in an easy way.

  • Spark Utils

    Basic framework utilities to quickly start writing production ready Apache Spark applications

  • Scout APM

    Truly a developer’s best friend. Scout APM is great for developers who want to find and fix performance issues in their applications. With Scout, we'll take care of the bugs so you can focus on building great things 🚀.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2022-11-24.

Scala Big Data related posts

Index

What are some of the best open-source Big Data projects in Scala? This list will help you:

Project Stars
1 Apache Spark 34,423
2 kafka-manager 11,120
3 delta 5,433
4 SynapseML 3,843
5 Scalding 3,385
6 Scio 2,397
7 Jupyter Scala 1,463
8 Reactive-kafka 1,391
9 adam 936
10 BIDMach 912
11 Gearpump 763
12 Vegas 724
13 metorikku 526
14 Sparkta 523
15 Scoobi 484
16 spark-rapids 478
17 nussknacker 340
18 Clustering4Ever 124
19 qbeast-spark 117
20 Schemer 110
21 Scoozie 81
22 spark-deployer 76
23 Spark Utils 30
Delete the most useless function ever: context switching.
Zigi monitors Jira and GitHub updates, pings you when PRs need approval and lets you take fast actions - all directly from Slack! Plus it reduces cycle time by up to 75%.
www.zigi.ai