apache-spark

Open-source projects categorized as apache-spark

Top 23 apache-spark Open-Source Projects

  • MLflow

    Open source platform for the machine learning lifecycle

  • Project mention: Observations on MLOps–A Fragmented Mosaic of Mismatched Expectations | dev.to | 2024-04-26

    How can this be? The current state of practice in AI/ML work requires adaptivity, which is uncommon in classical computational fields. There are myriad tools that capture the work across the many instances of the AI/ML lifecycle. The idea that any one tool could sufficiently capture the dynamic work is unrealistic. Take, for example, an experiment tracking tool like W&B or MLFlow; some form of experiment tracking is necessary in typical model training lifecycles. Such a tool requires some notion of a dataset. However, a tool focusing on experiment tracking is orthogonal to the needs of analyzing model performance at the data sample level, which is critical to understanding the failure modes of models. The way one does this depends on the type of data and the AI/ML task at hand. In other words, MLOps is inherently an intricate mosaic, as the capabilities and best practices of AI/ML work evolve.

  • SynapseML

    Simple and Distributed Machine Learning

  • Project mention: FLaNK Stack Weekly for 12 September 2023 | dev.to | 2023-09-12
  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • lakeFS

    lakeFS - Data version control for your data lake | Git for data

  • Project mention: A Step-by-Step Guide to Implementing Data Version Control | dev.to | 2023-09-04

    # Download the LakeFS binary wget https://github.com/treeverse/lakeFS/releases/latest/download/lakefs # Make the binary executable chmod +x lakefs # Initialize LakeFS with S3 as the storage backend ./lakefs init --backend s3 --s3-gateway-endpoint --s3-region --s3-force-path-style --s3-access-key --s3-secret-key

  • Spark Notebook

    Interactive and Reactive Data Science using Scala and Spark.

  • spark-operator

    Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.

  • Project mention: Dependency issue with Pyspark running on Kubernetes using spark-on-k8s-operator | /r/codehunter | 2023-05-31

    I have spent days now trying to figure out a dependency issue I'm experiencing with (Py)Spark running on Kubernetes. I'm using the spark-on-k8s-operator and Spark's Google Cloud connector.

  • docker-spark

    Apache Spark docker image

  • spark

    .NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers. (by dotnet)

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • feathr

    Feathr – A scalable, unified data and AI engineering platform for enterprise

  • awesome-spark

    A curated list of awesome Apache Spark packages and resources.

  • LearningSparkV2

    This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]

  • Mobius: C# API for Spark

    C# and F# language binding and extensions to Apache Spark (by microsoft)

  • sparkMeasure

    This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of Spark metrics, making it a practical choice for both developers and data engineers.

  • flintrock

    A command-line tool for launching Apache Spark clusters.

  • quinn

    pyspark methods to enhance developer productivity 📣 👯 🎉 (by MrPowers)

  • awesome-kafka

    A list about Apache Kafka

  • sparkle

    Haskell on Apache Spark. (by tweag)

  • PySpark-Boilerplate

    A boilerplate for writing PySpark Jobs

  • sparktorch

    Train and run Pytorch models on Apache Spark.

  • delight

    A Spark UI and Spark History Server alternative with CPU and Memory metrics! Delight is free, cross-platform, and open-source.

  • cuelake

    Use SQL to build ELT pipelines on a data lakehouse.

  • scalable-data-science

    Scalable Data Science, course sets in big data Using Apache Spark over databricks and their mathematical, statistical and computational foundations using SageMath.

  • spark

    Performance Observability for Apache Spark (by dataflint)

  • Project mention: Show HN: DataFlint, performance monitoring for Apache Spark | news.ycombinator.com | 2023-12-28
  • dataproc-templates

    Dataproc templates and pipelines for solving simple in-cloud data tasks

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

apache-spark related posts

  • Observations on MLOps–A Fragmented Mosaic of Mismatched Expectations

    1 project | dev.to | 26 Apr 2024
  • Explain me how websites like Dall-E, chatgpt, thispersondoesntexit process the user data so quickly

    1 project | /r/dataengineering | 17 Jun 2023
  • [D] What licensed software do you use for machine learning experimentation tracking?

    1 project | /r/MachineLearning | 11 Jun 2023
  • Dependency issue with Pyspark running on Kubernetes using spark-on-k8s-operator

    1 project | /r/codehunter | 31 May 2023
  • [Q] Is there a tool to keep track of my ML experiments?

    1 project | /r/datascience | 13 May 2023
  • Experience setting up Spark and Hudi on Kubernetes

    2 projects | /r/dataengineering | 15 Apr 2023
  • Remote file access vulnerability in `mlflow server` and `mlflow ui` CLIs

    1 project | /r/LanguageTechnology | 24 Mar 2023
  • A note from our sponsor - SaaSHub
    www.saashub.com | 5 May 2024
    SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source apache-spark projects? This list will help you:

Project Stars
1 MLflow 17,335
2 SynapseML 4,970
3 lakeFS 4,081
4 Spark Notebook 3,147
5 spark-operator 2,613
6 docker-spark 2,011
7 spark 1,999
8 feathr 1,931
9 awesome-spark 1,617
10 LearningSparkV2 1,095
11 Mobius: C# API for Spark 937
12 sparkMeasure 642
13 flintrock 630
14 quinn 580
15 awesome-kafka 565
16 sparkle 444
17 PySpark-Boilerplate 391
18 sparktorch 335
19 delight 332
20 cuelake 284
21 scalable-data-science 165
22 spark 126
23 dataproc-templates 111

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com