How to use Spark and Pandas to prepare big data

This page summarizes the projects mentioned and recommended in the original post on dev.to

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • Apache Spark

    Apache Spark - A unified analytics engine for large-scale data processing

  • Apache Spark is one of the most actively developed open-source projects in big data. The following code examples require that you have Spark set up and can execute Python code using the PySpark library. The examples also require that you have your data in Amazon S3 (Simple Storage Service). All this is set up on AWS EMR (Elastic MapReduce).

  • Pandas

    Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

  • We’ve learned a lot while setting up Spark on AWS EMR. While this post will focus on how to use PySpark with Pandas, let us know in the comments if you’re interested in a future article on how we set up Spark on AWS EMR.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • Apache Arrow

    Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

  • Pandas user-defined function (UDF) is built on top of Apache Arrow. Pandas UDF improves data performance by allowing developers to scale their workloads and leverage Panda’s APIs in Apache Spark. Pandas UDF works with Pandas APIs inside the function, and works with Apache Arrow to exchange data.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • How to use Spark and Pandas to prepare big data

    3 projects | dev.to | 21 Sep 2021
  • Interacting with Amazon S3 using AWS Data Wrangler (awswrangler) SDK for Pandas: A Comprehensive Guide

    5 projects | dev.to | 20 Aug 2023
  • Machine Learning Pipelines with Spark: Introductory Guide (Part 1)

    5 projects | dev.to | 23 Oct 2022
  • Arrow v1.0: After 8 years, a new milestone with a lot of new features

    3 projects | news.ycombinator.com | 26 Feb 2021
  • Deploying a Serverless Dash App with AWS SAM and Lambda

    3 projects | dev.to | 4 Mar 2024