Open-source projects categorized as Spark Edit details

Top 23 Spark Open-Source Projects

  • Apache Spark

    Apache Spark - A unified analytics engine for large-scale data processing

    Project mention: is anyone want to join maintaining spark java framework? | | 2022-06-21

    Wow, this has nothing to do with Apache Spark (, the wildly popular JVM based data processing framework.

  • data-science-ipython-notebooks

    Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

  • Scout APM

    Less time debugging, more time building. Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.

  • Redash

    Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.

    Project mention: Show HN: DataStation – App to easily query, script, and visualize data | | 2022-05-31

    How does it compare to Redash (now Databricks SQL):

  • cube.js

    📊 Cube — Headless Business Intelligence for Building Data Applications

    Project mention: Parsing logs from multiple data sources with Ahana and Cube | | 2022-06-08

    const measureNames = [ 'perc_10', 'perc_10_value', 'perc_20', 'perc_20_value', 'perc_30', 'perc_30_value', 'perc_40', 'perc_40_value', 'perc_50', 'perc_50_value', 'perc_60', 'perc_60_value', 'perc_70', 'perc_70_value', 'perc_80', 'perc_80_value', 'perc_90', 'perc_90_value', 'perc_100', 'perc_100_value', ]; const measures = Object.keys(measureNames).reduce((result, name) => { const sqlName = measureNames[name]; return { ...result, [sqlName]: { sql: () => sqlName, type: `max` } }; }, {}); cube('errorpercentiles', { sql: `with sagemaker as ( select model_name, variant_name, cast(json_extract(FROM_UTF8( from_base64(, '$.correlation_id') as varchar) as correlation_id, cast(json_extract(FROM_UTF8( from_base64(, '$.prediction') as double) as prediction from s3.sagemaker_logs.logs ) , actual as ( select correlation_id, actual_value from postgresql.public.actual_values ) , logs as ( select model_name, variant_name as model_variant, sagemaker.correlation_id, prediction, actual_value as actual from sagemaker left outer join actual on sagemaker.correlation_id = actual.correlation_id ) , errors as ( select abs(prediction - actual) as abs_err, model_name, model_variant from logs ), percentiles as ( select approx_percentile(abs_err, 0.1) as perc_10, approx_percentile(abs_err, 0.2) as perc_20, approx_percentile(abs_err, 0.3) as perc_30, approx_percentile(abs_err, 0.4) as perc_40, approx_percentile(abs_err, 0.5) as perc_50, approx_percentile(abs_err, 0.6) as perc_60, approx_percentile(abs_err, 0.7) as perc_70, approx_percentile(abs_err, 0.8) as perc_80, approx_percentile(abs_err, 0.9) as perc_90, approx_percentile(abs_err, 1.0) as perc_100, model_name, model_variant from errors group by model_name, model_variant ) SELECT count(*) FILTER (WHERE e.abs_err <= perc_10) AS perc_10 , max(perc_10) as perc_10_value , count(*) FILTER (WHERE e.abs_err > perc_10 and e.abs_err <= perc_20) AS perc_20 , max(perc_20) as perc_20_value , count(*) FILTER (WHERE e.abs_err > perc_20 and e.abs_err <= perc_30) AS perc_30 , max(perc_30) as perc_30_value , count(*) FILTER (WHERE e.abs_err > perc_30 and e.abs_err <= perc_40) AS perc_40 , max(perc_40) as perc_40_value , count(*) FILTER (WHERE e.abs_err > perc_40 and e.abs_err <= perc_50) AS perc_50 , max(perc_50) as perc_50_value , count(*) FILTER (WHERE e.abs_err > perc_50 and e.abs_err <= perc_60) AS perc_60 , max(perc_60) as perc_60_value , count(*) FILTER (WHERE e.abs_err > perc_60 and e.abs_err <= perc_70) AS perc_70 , max(perc_70) as perc_70_value , count(*) FILTER (WHERE e.abs_err > perc_70 and e.abs_err <= perc_80) AS perc_80 , max(perc_80) as perc_80_value , count(*) FILTER (WHERE e.abs_err > perc_80 and e.abs_err <= perc_90) AS perc_90 , max(perc_90) as perc_90_value , count(*) FILTER (WHERE e.abs_err > perc_90 and e.abs_err <= perc_100) AS perc_100 , max(perc_100) as perc_100_value , p.model_name, p.model_variant FROM percentiles p, errors e group by p.model_name, p.model_variant`, preAggregations: { // Pre-Aggregations definitions go here // Learn more here: }, joins: { }, measures: measures, dimensions: { modelVariant: { sql: `model_variant`, type: 'string' }, modelName: { sql: `model_name`, type: 'string' }, } });

  • horovod

    Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

    Project mention: Anyone know of any papers or models for segmenting satellite images of a city into things like roads, buildings, parks, etc? | | 2022-04-25

    Training is not the same as inference (doing the segmentation), so that scale is probably off by a lot. One or two orders of magnitude just depending on the specifics of what hardware you're running on, and your training and eval dataset would be several orders of magnitude smaller. FAANGs would parallelize that training as well (don't remember if UNet is inherently parallelizable for training) via their internal equivalent of Horovod, so they'll do a GPU-month worth of training in less than a day.

  • Deeplearning4j

    Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learning using automatic differentiation.

    Project mention: Data Science Competition | | 2022-03-25


  • ds-cheatsheets

    List of Data Science Cheatsheets to rule the world

    Project mention: ⚙️ Data Science Cheat Sheets: A collection of cheat sheets for #DataScience and problem solving. h/t @Sauain | | 2021-10-01
  • JetBrains

    Developer Ecosystem Survey 2022. Take part in the Developer Ecosystem Survey 2022 by JetBrains and get a chance to win a Macbook, a Nvidia graphics card, or other prizes. We’ll create an infographic full of stats, and you’ll get personalized results so you can compare yourself with other developers.

  • H2O

    H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

    Project mention: A Tiny Grammar of Graphics | | 2022-06-14
  • dev-setup

    macOS development environment setup: Easy-to-understand instructions with automated setup scripts for developer tools like Vim, Sublime Text, Bash, iTerm, Python data analysis, Spark, Hadoop MapReduce, AWS, Heroku, JavaScript web development, Android development, common data stores, and dev-based OS X defaults.

    Project mention: Automate Mac setup? | | 2022-04-10

    Something like this at least is the most direct answer to your question, as opposed to "you're doing it wrong" which unfortunately seems to be more upvoted. An example of something like this might be

  • Zeppelin

    Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

    Project mention: Visualization using Pyspark Dataframe | | 2022-05-14

    Have you tried Apache Zepellin I remember that you can pretty print spark dataframes directly on it with

  • Alluxio (formerly Tachyon)

    Alluxio, data orchestration for analytics and machine learning in the cloud

  • delta

    An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs for Scala, Java, Rust, Ruby, and Python. (by delta-io)

    Project mention: Databricks platform for small data, is it worth it? | | 2022-06-29

    Currently the infrastructure we have is some custom made pipelines that load the data on S3, and I use Delta Tables here and there for its convenience: ACID, time travel, merges, CDC etc...

  • BigDL

    Building Large-Scale AI Applications for Distributed Big Data

  • TensorFlowOnSpark

    TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.

    Project mention: [D]Speed up inference on Spark | | 2022-02-18

    Currently I use TensorflowOnSpark frame to train and predict model. When prediction, I have billions of samples to predict which is time-consuming. I wonder if there is some good practices on this.

  • SynapseML

    Simple and Distributed Machine Learning

    Project mention: [P] Microsoft releases SynapseML v0.9.5 with support for speech synthesis, anomaly detection, and geospatial analytics on large-scale data | | 2022-03-08

    Link to Release Notes:

  • HELK

    The Hunting ELK

    Project mention: Build a SOC LAB | | 2022-04-12
  • koalas

    Koalas: pandas API on Apache Spark

  • Spark Notebook

    Interactive and Reactive Data Science using Scala and Spark.

  • spark-nlp

    State of the Art Natural Language Processing

    Project mention: Spark-NLP 4.0.0 🚀: New modern extractive Question answering (QA) annotators for ALBERT, BERT, DistilBERT, DeBERTa, RoBERTa, Longformer, and XLM-RoBERTa, official support for Apple silicon M1, support oneDNN to improve CPU up to 97%, improved transformers on GPU up to +700%, 1000+ SOTA models | | 2022-06-15
  • dpark

    Python clone of Spark, a MapReduce alike framework in Python

  • RoaringBitmap

    A better compressed bitset in Java

    Project mention: Roaring bitmaps: A better compressed bitset | | 2022-05-19
  • deequ

    Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

    Project mention: Soda Core (OSS) is now GA! So, why should you add checks to your data pipelines? | | 2022-06-28

    GE is arguably the most well known OSS alternative to Soda Core. The third option is deequ, originally developed and released in OSS by AWS. Our community has told us that Soda Core is different because it’s easy to get going and embed into data pipelines. And it also allows some of the check authoring work to be moved to other members of the data team. I'm sure there are also scenarios where Soda Core is not the best option. For example, when you only use Pandas dataframes or develop in Scala.

  • Quill

    Compile-time Language Integrated Queries for Scala

    Project mention: Query DSL in Scala ? | | 2022-02-24

    I think Quill is the closest to your request:

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2022-06-29.

Spark related posts


What are some of the best open-source Spark projects? This list will help you:

Project Stars
1 Apache Spark 33,221
2 data-science-ipython-notebooks 23,245
3 Redash 21,321
4 cube.js 13,251
5 horovod 12,530
6 Deeplearning4j 12,509
7 ds-cheatsheets 10,649
8 H2O 5,869
9 dev-setup 5,754
10 Zeppelin 5,717
11 Alluxio (formerly Tachyon) 5,700
12 delta 4,473
13 BigDL 3,959
14 TensorFlowOnSpark 3,799
15 SynapseML 3,340
16 HELK 3,231
17 koalas 3,143
18 Spark Notebook 3,124
19 spark-nlp 2,785
20 dpark 2,679
21 RoaringBitmap 2,653
22 deequ 2,302
23 Quill 2,059
Find remote jobs at our new job board There are 3 new remote jobs listed recently.
Are you hiring? Post a new remote job listing for free.
Deliver Cleaner and Safer Code - Right in Your IDE of Choice!
SonarLint is a free and open source IDE extension that identifies and catches bugs and vulnerabilities as you code, directly in the IDE. Install from your favorite IDE marketplace today.