Spark open source community is awesome

This page summarizes the projects mentioned and recommended in the original post on /r/apachespark

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • mack

    Delta Lake helper methods in PySpark

  • a couple devs just added a `find_compositite_keys_candidates` function so users can easily identify columns that could be used as a unique identifier in their Delta table.

  • jodie

    Delta lake and filesystem helper methods (by MrPowers)

  • another dev is working on adding an elegant interface to perform Hadoop filesystem operations, similar to os-lib for regular filesystem operations

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • os-lib

    OS-Lib is a simple, flexible, high-performance Scala interface to common OS filesystem and subprocess APIs

  • another dev is working on adding an elegant interface to perform Hadoop filesystem operations, similar to os-lib for regular filesystem operations

  • chispa

    PySpark test helper methods with beautiful error messages

  • here's a little README fix a user pushed to chispa

  • delta-rs

    A native Rust library for Delta Lake, with bindings into Python

  • Yea, there are tons of employees from companies that have made massive contributions to the Spark ecosystem. Apple built Delta Lake with Databricks, see this video for more detail. Lots of Spark PMCs are from various companies. delta-rs was initially built by Scribd and is now actively maintained by engineers at Voltron & other companies. It's awesome the community has so many contributors from various sources.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts