Machine Learning Pipelines with Spark: Introductory Guide (Part 1)

This page summarizes the projects mentioned and recommended in the original post on dev.to

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • mleap

    MLeap: Deploy ML Pipelines to Production

  • Everything is custom and will take a lot of work, but luckily, you don’t have to do all the work here. In THE second article, you will use MLeap, a library that does the heavy lifting in terms of serializing Spark ML Pipeline for real-time inference and also provides an execution engine for Spark so you can deploy pipelines on non-Spark runtimes.

  • Apache Spark

    Apache Spark - A unified analytics engine for large-scale data processing

  • Apache Spark is a fast and general open-source engine for large-scale, distributed data processing. Its flexible in-memory framework allows it to handle batch and real-time analytics alongside distributed data processing.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • scikit-learn

    scikit-learn: machine learning in Python

  • The concepts are similar to the Scikit-learn project. They follow Spark’s “ease of use” characteristic giving you one more reason for adoption. You will learn more about these main concepts in this guide.

  • Pandas

    Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

  • DataFrames are a Pandas-like, intuitive high-level API for working with data in Spark. It organizes data in a structured and tabular format in rows and columns, similar to a spreadsheet and a relational database management system. If you have worked with Pandas before, you should be familiar with DataFrames.

  • kubernetes

    Production-Grade Container Scheduling and Management

  • Spark works locally on stand-alone clusters and on Hadoop YARN, Apache Mesos, Kubernetes, and other managed Hadoop platforms.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts