Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →
Top 17 Jupyter Notebook Spark Projects
-
H2O
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
JustEnoughScalaForSpark
A tutorial on the most important features and idioms of Scala that you need to use Spark's Scala APIs.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
lasagna
A Docker Compose template that builds a interactive development environment for PySpark with Jupyter Lab, MinIO as object storage, Hive Metastore, Trino and Kafka
-
amazon-emr-with-delta-lake
Amazon EMR Notebook to show how to read from and write to Delta tables with Amazon EMR
-
pyspark_nlp_workshop
Instructions and code for the workshop "From Big Data to NLP Insights: Unlocking the Power of PySpark and Spark NLP"
-
project-atlas-sao-paulo
A project for the development of rich geospatial data from the city of São Paulo for use in Machine Learning models.
-
workshop-introduction-to-machine-learning
Come ready to discover the goals and approaches of machine learning, and how to build effective algorithms and solutions!
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
References: Data engineering zoomcamp week 6 course and homework notes: https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/cohorts/2024/06-streaming
I would use H20 if I were you. You can try out LLMs with a nice GUI. Unless you have some familiarity with the tools needed to run these projects, it can be frustrating. https://h2o.ai/
Project mention: PySpark for NLP workshop – Jupyter notebooks and instructions | news.ycombinator.com | 2023-05-14
You can find the code related to this project in my GitHub repository.
Jupyter Notebook Spark related posts
- Data Engineering Zoomcamp Week 6 - using redpanda 1
- Final project part 5
- Building a project in DBT
- Testing and documenting DBT models
- Extracting data with dlt
- Data engineering at home?
- Rockstar Data Engineers making big bucks: what are you doing exactly?
-
A note from our sponsor - InfluxDB
www.influxdata.com | 25 Apr 2024
Index
What are some of the best open-source Spark projects in Jupyter Notebook? This list will help you:
Project | Stars | |
---|---|---|
1 | data-engineering-zoomcamp | 22,446 |
2 | H2O | 6,730 |
3 | HELK | 3,659 |
4 | JustEnoughScalaForSpark | 673 |
5 | Data-Engineering-Projects | 637 |
6 | ngods-stocks | 354 |
7 | rikai | 135 |
8 | lasagna | 27 |
9 | SDE | 22 |
10 | ghcn-d | 21 |
11 | amazon-emr-with-delta-lake | 17 |
12 | pyspark_nlp_workshop | 12 |
13 | project-atlas-sao-paulo | 9 |
14 | workshop-introduction-to-machine-learning | 7 |
15 | synapse-azure-data-explorer-101 | 4 |
16 | udacity_bike_share_datalake_project | 0 |
17 | dracula | 0 |
Sponsored