Spark

Open-source projects categorized as Spark | Edit details

Top 23 Spark Open-Source Projects

  • GitHub repo Apache Spark

    Apache Spark - A unified analytics engine for large-scale data processing

    Project mention: Technology Advice | reddit.com/r/dataengineering | 2021-11-03

    Have a look at Apache Spark

  • GitHub repo data-science-ipython-notebooks

    Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

    Project mention: Beginner in Python for Data Science | reddit.com/r/learnpython | 2020-12-27

    data science ipython notebooks

  • Scout APM

    Scout APM: A developer's best friend. Try free for 14-days. Scout APM uses tracing logic that ties bottlenecks to source code so you know the exact line of code causing performance issues and can get back to building a great product faster.

  • GitHub repo Redash

    Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.

    Project mention: How often do you use SQL query tool or service in your daily work? | reddit.com/r/SQL | 2021-11-21

    Regarding the subqueries: try https://tablum.io or https://redash.io, they materialize queried data so you can do a subquery multiple times.

  • GitHub repo Deeplearning4j

    Model import deployment framework for retraining models (pytorch, tensorflow,keras) deploying in JVM Micro service environments, mobile devices, iot, and Apache Spark

    Project mention: Does Java has similar project like this one in C#? (ml, data) | reddit.com/r/java | 2021-05-23

    Also, the website is now redirected to: https://deeplearning4j.konduit.ai/

  • GitHub repo horovod

    Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

    Project mention: [D] GPU buying recommendation | reddit.com/r/MachineLearning | 2021-07-17

    If you just want to run tensorflow or pytorch for a Jupyter notebook, setting the environment shouldn't be difficult. I know that AWS has a marketplace of preconfigured images. However, you can go as advanced as setting up a cluster of gpu-equipped nodes to setup Horovod (https://github.com/horovod/horovod) to do distributed machine learning. Yes, there's a learning curve, but you cannot acquire this skillet any other way.

  • GitHub repo cube.js

    📊 Cube — Open-Source Analytics API for Building Data Apps

    Project mention: Building an internal dashboard with Retool and Cube | dev.to | 2021-11-16

    I'll be using a hosted Cube deployment on Cube Cloud to get aggregated data from an e-commerce dataset and Retool as the visualization tool to generate a metrics dashboard.

  • GitHub repo Thingsboard

    Open-source IoT Platform - Device management, data collection, processing and visualization.

    Project mention: (esp32+adxl325+sdcard)Storing sensor data in sd card(in txt file) and then sending it to cloud. | reddit.com/r/IOT | 2021-10-16

    Thingsboard and thingspeak are some more user friendly alternatives that MIGHT help accomplish your goal, I've not used them before.

  • Nanos

    Run Linux Software Faster and Safer than Linux with Unikernels.

  • GitHub repo H2O

    H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

  • GitHub repo dev-setup

    macOS development environment setup: Easy-to-understand instructions with automated setup scripts for developer tools like Vim, Sublime Text, Bash, iTerm, Python data analysis, Spark, Hadoop MapReduce, AWS, Heroku, JavaScript web development, Android development, common data stores, and dev-based OS X defaults.

    Project mention: MacOS Development workspace 2021 | dev.to | 2021-03-08

    donnemartin - dev setup

  • GitHub repo Zeppelin

    Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

    Project mention: How to use IPython in Apache Zeppelin Notebook | dev.to | 2021-07-10

    [1] Apache Zeppelin http://zeppelin.apache.org/ [2] Zeppelin notebooks website http://zeppelin-notebook.com/. [3] Zeppelin notebooks git repo https://github.com/zjffdu/zeppelin-notebook

  • GitHub repo Alluxio (formerly Tachyon)

    Alluxio, data orchestration for analytics and machine learning in the cloud

  • GitHub repo delta

    An open-source storage layer that brings scalable, ACID transactions to Apache Spark™ and big data workloads. (by delta-io)

    Project mention: Snowflake’s response to Databricks’ TPC-DS post | news.ycombinator.com | 2021-11-12

    >the delta also still keeps partitioning information in the hive metastore, while iceberg keeps it in storage, making it a far superior design.

    Check out https://github.com/delta-io/delta/blob/3ffb30d86c6acda9b59b9... when you get a chance. You don't need hive metastore to query delta tables since all metadata for a Delta table is stored alongside the data

    >they did not include features like optimizing small files

  • GitHub repo BigDL

    Building Large-Scale AI Applications for Distributed Big Data

    Project mention: Machine learning on JVM | reddit.com/r/scala | 2021-04-05

    Intel BigDL for Spark which again is for Spark.

  • GitHub repo Spark Notebook

    Interactive and Reactive Data Science using Scala and Spark.

  • GitHub repo HELK

    The Hunting ELK

    Project mention: Home lab with security monitoring tools? | reddit.com/r/netsecstudents | 2021-09-23

    HELK can help for the SIEM and detection part

  • GitHub repo koalas

    Koalas: pandas API on Apache Spark

    Project mention: Spark vs Pandas | reddit.com/r/dataengineering | 2021-02-18

    If you like excessive use of square brackets.. I mean pandas, you might wanna check out Koalas. Koalas suppose to provide pandas datafrafe API implementation atop of Spark.

  • GitHub repo SynapseML

    Simple and Distributed Machine Learning

    Project mention: Microsoft AI Open-Sources ‘SynapseML’ For Developing Scalable Machine Learning Pipelines | reddit.com/r/artificial | 2021-11-18

    Microsoft has announced the release of SynapseML, an open-source library that simplifies and speeds up the creation of machine learning (ML) pipelines. SynapseML can be used for building scalable and intelligent systems to solve various types of challenges, including anomaly detection, computer vision, deep learning, form and face recognition, Gradient boosting, microservice orchestration, model interpretability, reinforcement learning, and personalization, search and retrieval, speech processing, text analytics, and translation.

  • GitHub repo dpark

    Python clone of Spark, a MapReduce alike framework in Python

  • GitHub repo feast

    Feature Store for Machine Learning

    Project mention: [P] Announcing Feast 0.10: The simplest way to serve features in production | reddit.com/r/MachineLearning | 2021-04-15

    Github: https://github.com/feast-dev/feast

  • GitHub repo spark-nlp

    State of the Art Natural Language Processing

    Project mention: November 2021 workshops -- please comment about your preferences | reddit.com/r/Clojure | 2021-10-15
  • GitHub repo RoaringBitmap

    A better compressed bitset in Java

    Project mention: Seeking: efficient CL bitsets. | reddit.com/r/lisp | 2021-10-03

    might be able to use one of the roaring bitmap implementations https://roaringbitmap.org/ via ffi, or port one to CL. been using them from clojure via java implementation, great lib.

  • GitHub repo deequ

    Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

    Project mention: PySpark - How to get Corrupted Records after Casting | reddit.com/r/dataengineering | 2021-09-28

    Deequ (this is the Scala version but they have PyDeequ also)

  • GitHub repo Quill

    Compile-time Language Integrated Queries for Scala (by getquill)

    Project mention: Fp libraries that target scala 3 exclusively? | reddit.com/r/scala | 2021-11-22

    I know that libraries like Scodec and shapeless were rewritten practically from scratch for Scala 3, taking advantage of the next syntax and internals, as well as protoquill - a Scala 3 implementation of Quill.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2021-11-22.

Spark related posts

Index

What are some of the best open-source Spark projects? This list will help you:

Project Stars
1 Apache Spark 31,417
2 data-science-ipython-notebooks 21,842
3 Redash 19,987
4 Deeplearning4j 12,250
5 horovod 11,881
6 cube.js 11,815
7 Thingsboard 10,406
8 H2O 5,631
9 dev-setup 5,566
10 Zeppelin 5,486
11 Alluxio (formerly Tachyon) 5,335
12 delta 3,840
13 BigDL 3,802
14 Spark Notebook 3,073
15 HELK 3,071
16 koalas 3,027
17 SynapseML 2,757
18 dpark 2,664
19 feast 2,490
20 spark-nlp 2,475
21 RoaringBitmap 2,427
22 deequ 1,988
23 Quill 1,982
Find remote jobs at our new job board 99remotejobs.com. There are 34 new remote jobs listed recently.
Are you hiring? Post a new remote job listing for free.
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com