Our great sponsors
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
Pandas
Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
Everything is custom and will take a lot of work, but luckily, you don’t have to do all the work here. In THE second article, you will use MLeap, a library that does the heavy lifting in terms of serializing Spark ML Pipeline for real-time inference and also provides an execution engine for Spark so you can deploy pipelines on non-Spark runtimes.
Apache Spark is a fast and general open-source engine for large-scale, distributed data processing. Its flexible in-memory framework allows it to handle batch and real-time analytics alongside distributed data processing.
The concepts are similar to the Scikit-learn project. They follow Spark’s “ease of use” characteristic giving you one more reason for adoption. You will learn more about these main concepts in this guide.
DataFrames are a Pandas-like, intuitive high-level API for working with data in Spark. It organizes data in a structured and tabular format in rows and columns, similar to a spreadsheet and a relational database management system. If you have worked with Pandas before, you should be familiar with DataFrames.
Spark works locally on stand-alone clusters and on Hadoop YARN, Apache Mesos, Kubernetes, and other managed Hadoop platforms.