Machine Learning Pipelines with Spark: Introductory Guide (Part 1)

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

mleap

1 1,494 5.2 Scala

MLeap: Deploy ML Pipelines to Production

Everything is custom and will take a lot of work, but luckily, you don’t have to do all the work here. In THE second article, you will use MLeap, a library that does the heavy lifting in terms of serializing Spark ML Pipeline for real-time inference and also provides an execution engine for Spark so you can deploy pipelines on non-Spark runtimes.

Apache Spark

101 38,249 10.0 Scala

Apache Spark - A unified analytics engine for large-scale data processing

Apache Spark is a fast and general open-source engine for large-scale, distributed data processing. Its flexible in-memory framework allows it to handle batch and real-time analytics alongside distributed data processing.

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
scikit-learn

81 57,985 9.9 Python

scikit-learn: machine learning in Python

The concepts are similar to the Scikit-learn project. They follow Spark’s “ease of use” characteristic giving you one more reason for adoption. You will learn more about these main concepts in this guide.

Pandas

393 41,863 10.0 Python

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

DataFrames are a Pandas-like, intuitive high-level API for working with data in Spark. It organizes data in a structured and tabular format in rows and columns, similar to a spreadsheet and a relational database management system. If you have worked with Pandas before, you should be familiar with DataFrames.

kubernetes

656 106,611 10.0 Go

Production-Grade Container Scheduling and Management

Spark works locally on stand-alone clusters and on Hadoop YARN, Apache Mesos, Kubernetes, and other managed Hadoop platforms.

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

How to Build and Deploy a Machine Learning model using Docker
5 projects | dev.to | 30 Jul 2023
Build A Covid-19 EDA & Viz App Using Streamlit
3 projects | dev.to | 5 Dec 2022
Talking Data: What do we need for engaging data analytics?
4 projects | dev.to | 6 Oct 2022
A peek into Location Data Science at Ola
6 projects | dev.to | 26 Sep 2022
Can anyone share some good examples of Python OOP Repos for DS?
4 projects | /r/datascience | 17 Sep 2022

Machine Learning Pipelines with Spark: Introductory Guide (Part 1)

This page summarizes the projects mentioned and recommended in the original post on dev.to
Python Data Analysis Science and Data analysis Machine Learning MapReduce
Post date: 23 Oct 2022

mleap

Apache Spark

WorkOS

scikit-learn

Pandas

kubernetes

InfluxDB

Related posts

Machine Learning Pipelines with Spark: Introductory Guide (Part 1)

This page summarizes the projects mentioned and recommended in the original post on dev.to Python Data Analysis Science and Data analysis Machine Learning MapReduce Post date: 23 Oct 2022

mleap

Apache Spark

WorkOS

scikit-learn

Pandas

kubernetes

InfluxDB

Related posts

This page summarizes the projects mentioned and recommended in the original post on dev.to
Python Data Analysis Science and Data analysis Machine Learning MapReduce
Post date: 23 Oct 2022