How to use Spark and Pandas to prepare big data

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

Apache Spark

101 38,378 10.0 Scala

Apache Spark - A unified analytics engine for large-scale data processing

Apache Spark is one of the most actively developed open-source projects in big data. The following code examples require that you have Spark set up and can execute Python code using the PySpark library. The examples also require that you have your data in Amazon S3 (Simple Storage Service). All this is set up on AWS EMR (Elastic MapReduce).

Pandas

395 41,983 10.0 Python

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

We’ve learned a lot while setting up Spark on AWS EMR. While this post will focus on how to use PySpark with Pandas, let us know in the comments if you’re interested in a future article on how we set up Spark on AWS EMR.

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
Apache Arrow

75 13,523 10.0 C++

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

Pandas user-defined function (UDF) is built on top of Apache Arrow. Pandas UDF improves data performance by allowing developers to scale their workloads and leverage Panda’s APIs in Apache Spark. Pandas UDF works with Pandas APIs inside the function, and works with Apache Arrow to exchange data.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

How to use Spark and Pandas to prepare big data

3 projects | dev.to | 21 Sep 2021
Interacting with Amazon S3 using AWS Data Wrangler (awswrangler) SDK for Pandas: A Comprehensive Guide

5 projects | dev.to | 20 Aug 2023
Machine Learning Pipelines with Spark: Introductory Guide (Part 1)

5 projects | dev.to | 23 Oct 2022
Arrow v1.0: After 8 years, a new milestone with a lot of new features

3 projects | news.ycombinator.com | 26 Feb 2021
Deploying a Serverless Dash App with AWS SAM and Lambda

3 projects | dev.to | 4 Mar 2024

How to use Spark and Pandas to prepare big data

This page summarizes the projects mentioned and recommended in the original post on dev.to
Python Science and Data analysis MapReduce Arrow Data Analysis
Post date: 10 May 2022

Apache Spark

Pandas

InfluxDB

Apache Arrow

Related posts