Top 23 apache-spark Open-Source Projects

MLflow

56 17,335 9.9 Python

Open source platform for the machine learning lifecycle

Project mention: Observations on MLOps–A Fragmented Mosaic of Mismatched Expectations | dev.to | 2024-04-26

How can this be? The current state of practice in AI/ML work requires adaptivity, which is uncommon in classical computational fields. There are myriad tools that capture the work across the many instances of the AI/ML lifecycle. The idea that any one tool could sufficiently capture the dynamic work is unrealistic. Take, for example, an experiment tracking tool like W&B or MLFlow; some form of experiment tracking is necessary in typical model training lifecycles. Such a tool requires some notion of a dataset. However, a tool focusing on experiment tracking is orthogonal to the needs of analyzing model performance at the data sample level, which is critical to understanding the failure modes of models. The way one does this depends on the type of data and the AI/ML task at hand. In other words, MLOps is inherently an intricate mosaic, as the capabilities and best practices of AI/ML work evolve.

SynapseML

18 4,970 9.0 Scala

Simple and Distributed Machine Learning

Project mention: FLaNK Stack Weekly for 12 September 2023 | dev.to | 2023-09-12

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
lakeFS

48 4,081 9.8 Go

lakeFS - Data version control for your data lake | Git for data

Project mention: A Step-by-Step Guide to Implementing Data Version Control | dev.to | 2023-09-04

# Download the LakeFS binary wget https://github.com/treeverse/lakeFS/releases/latest/download/lakefs # Make the binary executable chmod +x lakefs # Initialize LakeFS with S3 as the storage backend ./lakefs init --backend s3 --s3-gateway-endpoint --s3-region --s3-force-path-style --s3-access-key --s3-secret-key

Spark Notebook

0 3,147 0.0 JavaScript

Interactive and Reactive Data Science using Scala and Spark.
spark-operator

8 2,613 8.2 Go

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.

Project mention: Dependency issue with Pyspark running on Kubernetes using spark-on-k8s-operator | /r/codehunter | 2023-05-31

I have spent days now trying to figure out a dependency issue I'm experiencing with (Py)Spark running on Kubernetes. I'm using the spark-on-k8s-operator and Spark's Google Cloud connector.

docker-spark

1 2,011 0.0 Shell

Apache Spark docker image
spark

3 1,999 0.0 C#

.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers. (by dotnet)
SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
feathr

9 1,931 5.7 Scala

Feathr – A scalable, unified data and AI engineering platform for enterprise
awesome-spark

1 1,617 1.0 Shell

A curated list of awesome Apache Spark packages and resources.
LearningSparkV2

1 1,095 0.0 Scala

This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
Mobius: C# API for Spark

0 937 4.2 C#

C# and F# language binding and extensions to Apache Spark (by microsoft)
sparkMeasure

1 642 7.5 Scala

This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of Spark metrics, making it a practical choice for both developers and data engineers.
flintrock

1 630 4.7 Python

A command-line tool for launching Apache Spark clusters.
quinn

9 580 9.1 Python

pyspark methods to enhance developer productivity 📣 👯 🎉 (by MrPowers)
awesome-kafka

1 565 4.7

A list about Apache Kafka
sparkle

0 444 0.0 Haskell

Haskell on Apache Spark. (by tweag)
PySpark-Boilerplate

1 391 2.5 Python

A boilerplate for writing PySpark Jobs
sparktorch

1 335 2.5 Python

Train and run Pytorch models on Apache Spark.
delight

2 332 1.2 Scala

A Spark UI and Spark History Server alternative with CPU and Memory metrics! Delight is free, cross-platform, and open-source.
cuelake

2 284 0.0 JavaScript

Use SQL to build ELT pipelines on a data lakehouse.
scalable-data-science

1 165 3.0 HTML

Scalable Data Science, course sets in big data Using Apache Spark over databricks and their mathematical, statistical and computational foundations using SageMath.
spark

1 126 9.5 TypeScript

Performance Observability for Apache Spark (by dataflint)

Project mention: Show HN: DataFlint, performance monitoring for Apache Spark | news.ycombinator.com | 2023-12-28

dataproc-templates

1 111 8.7 Python

Dataproc templates and pipelines for solving simple in-cloud data tasks
SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

apache-spark related posts

Observations on MLOps–A Fragmented Mosaic of Mismatched Expectations

1 project | dev.to | 26 Apr 2024
Explain me how websites like Dall-E, chatgpt, thispersondoesntexit process the user data so quickly

1 project | /r/dataengineering | 17 Jun 2023
[D] What licensed software do you use for machine learning experimentation tracking?

1 project | /r/MachineLearning | 11 Jun 2023
Dependency issue with Pyspark running on Kubernetes using spark-on-k8s-operator

1 project | /r/codehunter | 31 May 2023
[Q] Is there a tool to keep track of my ML experiments?

1 project | /r/datascience | 13 May 2023
Experience setting up Spark and Hudi on Kubernetes

2 projects | /r/dataengineering | 15 Apr 2023
Remote file access vulnerability in `mlflow server` and `mlflow ui` CLIs

1 project | /r/LanguageTechnology | 24 Mar 2023
A note from our sponsor - SaaSHub
www.saashub.com | 5 May 2024

SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source apache-spark projects? This list will help you:

	Project	Stars
1	MLflow	17,335
2	SynapseML	4,970
3	lakeFS	4,081
4	Spark Notebook	3,147
5	spark-operator	2,613
6	docker-spark	2,011
7	spark	1,999
8	feathr	1,931
9	awesome-spark	1,617
10	LearningSparkV2	1,095
11	Mobius: C# API for Spark	937
12	sparkMeasure	642
13	flintrock	630
14	quinn	580
15	awesome-kafka	565
16	sparkle	444
17	PySpark-Boilerplate	391
18	sparktorch	335
19	delight	332
20	cuelake	284
21	scalable-data-science	165
22	spark	126
23	dataproc-templates	111

apache-spark

Top 23 apache-spark Open-Source Projects

apache-spark related posts

Observations on MLOps–A Fragmented Mosaic of Mismatched Expectations

Explain me how websites like Dall-E, chatgpt, thispersondoesntexit process the user data so quickly

[D] What licensed software do you use for machine learning experimentation tracking?

Dependency issue with Pyspark running on Kubernetes using spark-on-k8s-operator

[Q] Is there a tool to keep track of my ML experiments?

Experience setting up Spark and Hudi on Kubernetes

Remote file access vulnerability in `mlflow server` and `mlflow ui` CLIs

Index