Spark

Open-source projects categorized as Spark

Top 23 Spark Open-Source Projects

  • Apache Spark

    Apache Spark - A unified analytics engine for large-scale data processing

  • Project mention: "xAI will open source Grok" | news.ycombinator.com | 2024-03-11
  • data-science-ipython-notebooks

    Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • Redash

    Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.

  • Project mention: Redash: Connect to data source, easily visualize, dashboard and share your data | news.ycombinator.com | 2024-03-20
  • data-engineering-zoomcamp

    Free Data Engineering course!

  • Project mention: Data Engineering Zoomcamp Week 6 - using redpanda 1 | dev.to | 2024-04-09

    References: Data engineering zoomcamp week 6 course and homework notes: https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/cohorts/2024/06-streaming

  • horovod

    Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

  • Deeplearning4j

    Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learning using automatic differentiation.

  • Project mention: Deeplearning4j Suite Overview | news.ycombinator.com | 2024-03-29
  • ds-cheatsheets

    List of Data Science Cheatsheets to rule the world

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • doris

    Apache Doris is an easy-to-use, high performance and unified analytics database.

  • Project mention: Variant in Apache Doris 2.1.0: a new data type 8 times faster than JSON for semi-structured data analysis | dev.to | 2024-03-27

    As an open-source real-time data warehouse, Apache Doris provides semi-structured data processing capabilities, and the newly-released version 2.1.0 makes a stride in this direction. Before V2.1, Apache Doris stores semi-structured data as JSON files. However, during query execution, the real-time parsing of JSON data leads to high CPU and I/O consumption in addition to high query latency, especially when the dataset is huge and complicated. Moreover, the lack of a pre-defined schema means there is no handle for query optimization.

  • Mage

    🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai

  • Project mention: A mage on the Hero’s Journey: a fantasy epic on how a startup rose from the ashes | dev.to | 2023-06-12

    In the coming years, Mage will create a cooperative experience so that developers can build data pipelines with their team and level up together. After that journey, Mage will go on an epic quest to create the 1st open world community experience in the data universe.

  • delta

    An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)

  • Project mention: Delta Lake vs. Parquet: A Comparison | news.ycombinator.com | 2024-01-19

    Delta is pretty great, let's you do upserts into tables in DataBricks much easier than without it.

    I think the website is here: https://delta.io

  • H2O

    H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

  • Project mention: Really struggling with open source models | /r/LocalLLaMA | 2023-07-12

    I would use H20 if I were you. You can try out LLMs with a nice GUI. Unless you have some familiarity with the tools needed to run these projects, it can be frustrating. https://h2o.ai/

  • Alluxio (formerly Tachyon)

    Alluxio, data orchestration for analytics and machine learning in the cloud

  • Zeppelin

    Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

  • Project mention: Serverless Apache Zeppelin on AWS | dev.to | 2024-02-04

    Now we can proceed with the definition of Apache Zeppelin. It is a web-based notebook that enables data-driven, interactive data analytics and collaborative documents with Python, Scala, SQL, Spark, and more. You can execute code and even schedule a job (via cron) to run at regular intervals.

  • dev-setup

    macOS development environment setup: Easy-to-understand instructions with automated setup scripts for developer tools like Vim, Sublime Text, Bash, iTerm, Python data analysis, Spark, Hadoop MapReduce, AWS, Heroku, JavaScript web development, Android development, common data stores, and dev-based OS X defaults.

  • BigDL

    Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max). A PyTorch LLM library that seamlessly integrates with llama.cpp, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, ModelScope, etc.

  • Project mention: LLaMA Now Goes Faster on CPUs | news.ycombinator.com | 2024-03-31

    Any performance benchmark against intel's 'IPEX-LLM'[0] or others?

    [0] - https://github.com/intel-analytics/ipex-llm

  • sqlglot

    Python SQL Parser and Transpiler

  • Project mention: Transpile Any SQL to PostgreSQL Dialect | news.ycombinator.com | 2024-03-18

    Recommend checking out https://github.com/tobymao/sqlglot if you are interested in this capability for other SQL dialects

    Tools like this are helpful for:

    - Rendering SQL in a consistent way, eg for snapshot testing

  • SynapseML

    Simple and Distributed Machine Learning

  • Project mention: FLaNK Stack Weekly for 12 September 2023 | dev.to | 2023-09-12
  • TensorFlowOnSpark

    TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.

  • spark-nlp

    State of the Art Natural Language Processing

  • Project mention: Spark NLP 5.1.0: Introducing state-of-the-art OpenAI Whisper speech-to-text, OpenAI Embeddings and Completion transformers, MPNet text embeddings, ONNX support for E5 text embeddings, new multi-lingual BART Zero-Shot text classification, and much more! | /r/Python | 2023-09-06
  • HELK

    The Hunting ELK

  • RoaringBitmap

    A better compressed bitset in Java: used by Apache Spark, Netflix Atlas, Apache Pinot, Tablesaw, and many others

  • Project mention: Iterating over Bit Sets Quickly | news.ycombinator.com | 2024-02-24

    I was recently reading about Roaring https://roaringbitmap.org/ which is a highly optimized compressed bitset implementation. I reccomend reading about it if you are interested in this sort of thing. The talk at https://roaringbitmap.org/talks/ is especially good.

  • koalas

    Koalas: pandas API on Apache Spark

  • linkis

    Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-04-09.

Spark related posts

Index

What are some of the best open-source Spark projects? This list will help you:

Project Stars
1 Apache Spark 38,249
2 data-science-ipython-notebooks 26,438
3 Redash 24,917
4 data-engineering-zoomcamp 22,343
5 horovod 13,942
6 Deeplearning4j 13,408
7 ds-cheatsheets 12,570
8 doris 11,272
9 Mage 6,953
10 delta 6,847
11 H2O 6,721
12 Alluxio (formerly Tachyon) 6,624
13 Zeppelin 6,261
14 dev-setup 6,032
15 BigDL 5,857
16 sqlglot 5,389
17 SynapseML 4,964
18 TensorFlowOnSpark 3,864
19 spark-nlp 3,667
20 HELK 3,659
21 RoaringBitmap 3,372
22 koalas 3,317
23 linkis 3,225
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com