Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →
Top 23 Spark Open-Source Projects
-
data-science-ipython-notebooks
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
Redash
Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.
-
Deeplearning4j
Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learning using automatic differentiation.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
Mage
🧙 The modern replacement for Airflow. Mage is an open-source data pipeline tool for transforming and integrating data. https://github.com/mage-ai/mage-ai
-
delta
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)
-
H2O
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
-
Alluxio (formerly Tachyon)
Alluxio, data orchestration for analytics and machine learning in the cloud
-
Zeppelin
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
-
dev-setup
macOS development environment setup: Easy-to-understand instructions with automated setup scripts for developer tools like Vim, Sublime Text, Bash, iTerm, Python data analysis, Spark, Hadoop MapReduce, AWS, Heroku, JavaScript web development, Android development, common data stores, and dev-based OS X defaults.
-
BigDL
Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max). A PyTorch LLM library that seamlessly integrates with llama.cpp, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, ModelScope, etc.
-
RoaringBitmap
A better compressed bitset in Java: used by Apache Spark, Netflix Atlas, Apache Pinot, Tablesaw, and many others
-
linkis
Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Project mention: Redash: Connect to data source, easily visualize, dashboard and share your data | news.ycombinator.com | 2024-03-20
References: Data engineering zoomcamp week 6 course and homework notes: https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/cohorts/2024/06-streaming
Project mention: Variant in Apache Doris 2.1.0: a new data type 8 times faster than JSON for semi-structured data analysis | dev.to | 2024-03-27As an open-source real-time data warehouse, Apache Doris provides semi-structured data processing capabilities, and the newly-released version 2.1.0 makes a stride in this direction. Before V2.1, Apache Doris stores semi-structured data as JSON files. However, during query execution, the real-time parsing of JSON data leads to high CPU and I/O consumption in addition to high query latency, especially when the dataset is huge and complicated. Moreover, the lack of a pre-defined schema means there is no handle for query optimization.
Project mention: A mage on the Hero’s Journey: a fantasy epic on how a startup rose from the ashes | dev.to | 2023-06-12In the coming years, Mage will create a cooperative experience so that developers can build data pipelines with their team and level up together. After that journey, Mage will go on an epic quest to create the 1st open world community experience in the data universe.
Delta is pretty great, let's you do upserts into tables in DataBricks much easier than without it.
I think the website is here: https://delta.io
I would use H20 if I were you. You can try out LLMs with a nice GUI. Unless you have some familiarity with the tools needed to run these projects, it can be frustrating. https://h2o.ai/
Now we can proceed with the definition of Apache Zeppelin. It is a web-based notebook that enables data-driven, interactive data analytics and collaborative documents with Python, Scala, SQL, Spark, and more. You can execute code and even schedule a job (via cron) to run at regular intervals.
Any performance benchmark against intel's 'IPEX-LLM'[0] or others?
Recommend checking out https://github.com/tobymao/sqlglot if you are interested in this capability for other SQL dialects
Tools like this are helpful for:
- Rendering SQL in a consistent way, eg for snapshot testing
Project mention: Spark NLP 5.1.0: Introducing state-of-the-art OpenAI Whisper speech-to-text, OpenAI Embeddings and Completion transformers, MPNet text embeddings, ONNX support for E5 text embeddings, new multi-lingual BART Zero-Shot text classification, and much more! | /r/Python | 2023-09-06
I was recently reading about Roaring https://roaringbitmap.org/ which is a highly optimized compressed bitset implementation. I reccomend reading about it if you are interested in this sort of thing. The talk at https://roaringbitmap.org/talks/ is especially good.
Spark related posts
- Data Engineering Zoomcamp Week 6 - using redpanda 1
- Final project part 5
- PyTorch Library for Running LLM on Intel CPU and GPU
- Apache Uniffle: high performance, general purpose remote shuffle service
- Splink: Fast, accurate, scalable probabilistic data linkage
- Apache Arrow DataFusion Comet Spark Accelerator
- A deep dive into the concept and world of Apache Iceberg Catalogs
-
A note from our sponsor - InfluxDB
www.influxdata.com | 19 Apr 2024
Index
What are some of the best open-source Spark projects? This list will help you:
Project | Stars | |
---|---|---|
1 | Apache Spark | 38,249 |
2 | data-science-ipython-notebooks | 26,438 |
3 | Redash | 24,917 |
4 | data-engineering-zoomcamp | 22,343 |
5 | horovod | 13,942 |
6 | Deeplearning4j | 13,408 |
7 | ds-cheatsheets | 12,570 |
8 | doris | 11,272 |
9 | Mage | 6,953 |
10 | delta | 6,847 |
11 | H2O | 6,721 |
12 | Alluxio (formerly Tachyon) | 6,624 |
13 | Zeppelin | 6,261 |
14 | dev-setup | 6,032 |
15 | BigDL | 5,857 |
16 | sqlglot | 5,389 |
17 | SynapseML | 4,964 |
18 | TensorFlowOnSpark | 3,864 |
19 | spark-nlp | 3,667 |
20 | HELK | 3,659 |
21 | RoaringBitmap | 3,372 |
22 | koalas | 3,317 |
23 | linkis | 3,225 |