Pyspark now provides a native Pandas API

This page summarizes the projects mentioned and recommended in the original post on /r/Python

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • fugue

    A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.

    There's dask-sql, but I think it is being abandoned for fugue-project. I'm actually excited for this project as it is trying to provide a backend agnostic solution, which would seem like a difficult, lofty goal. I wish them luck.

  • quinn

    pyspark methods to enhance developer productivity 📣 👯 🎉 (by MrPowers)

    Pandas syntax is far inferior to regular PySpark in my opinion. Goes to show how much data analysts value a syntax that they're already familiar with. Pandas syntax makes it harder to reason about queries, abstract DataFrame transformations, etc. I've authored some popular PySpark libraries like quinn and chispa and am not excited to add Pandas syntax support, haha.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

  • chispa

    PySpark test helper methods with beautiful error messages

    Pandas syntax is far inferior to regular PySpark in my opinion. Goes to show how much data analysts value a syntax that they're already familiar with. Pandas syntax makes it harder to reason about queries, abstract DataFrame transformations, etc. I've authored some popular PySpark libraries like quinn and chispa and am not excited to add Pandas syntax support, haha.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts