Apache Spark

Apache Spark - A unified analytics engine for large-scale data processing (by apache)

Apache Spark Alternatives

Similar projects and alternatives to Apache Spark

  1. kubernetes

    Production-Grade Container Scheduling and Management

  2. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  3. PostgreSQL

    Mirror of the official PostgreSQL GIT repository. Note that this is just a *mirror* - we don't work with pull requests on github. To contribute, please see https://wiki.postgresql.org/wiki/Submitting_a_Patch

  4. Pandas

    449 Apache Spark VS Pandas

    Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

  5. Redis

    For developers, who are building real-time data-driven applications, Redis is the preferred, fastest, and most feature-rich cache, data structure server, and document and vector query engine.

  6. MongoDB

    The MongoDB Database

  7. Airflow

    205 Apache Spark VS Airflow

    Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

  8. examples

    185 Apache Spark VS examples

    TensorFlow examples (by tensorflow)

  9. ApacheKafka

    A curated re-sources list for awesome Apache Kafka

  10. Apache Arrow

    Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics

  11. delta

    81 Apache Spark VS delta

    An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)

  12. redpanda

    Redpanda is a streaming data platform for developers. Kafka API compatible. 10x faster. No ZooKeeper. No JVM!

  13. dagster

    62 Apache Spark VS dagster

    An orchestration platform for the development, production, and observation of data assets.

  14. Trino

    54 Apache Spark VS Trino

    Official repository of Trino, the distributed SQL query engine for big data, former

  15. Apache Kafka

    Apache Kafka - A distributed event streaming platform

  16. Apache Cassandra

    Open source transactional distributed database. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure without compromising performance.

  17. Apache Pulsar

    Apache Pulsar - distributed pub-sub messaging system

  18. Apache Hive

    Apache Hive (by apache)

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a better Apache Spark alternative or higher similarity.

Apache Spark discussion

Log in or Post with
  1. User avatar
    combinatorist
    · almost 2 years ago
    · Reply

    Review ☆☆☆☆☆ /10

    Wonderful if you need to do a lot of complex or high volume analytics / data pipelines. I recommend going the extra mile and learning Scala, but python is available for those who prefer (wouldn't consider Java or R, but I'm biased).

Apache Spark reviews and mentions

Posts with mentions or reviews of Apache Spark. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2026-06-02.
  • MLOps Lifecycle: Stages, Workflow, and Best Practices
    4 projects | dev.to | 2 Jun 2026
    Feature transformations should be deterministic: The same input should produce the same output when the same feature definition and configuration are applied. This is what allows training, backtesting, and live inference to remain aligned. Tools such as Pandas, Spark, or feature platforms such as Feast can be used to implement that logic.
  • 7 Free Tools for Data Pipeline Reconciliation and Cross-Source Validation
    4 projects | dev.to | 13 May 2026
    Apache Spark provides distributed in-memory data processing and is the appropriate tool when the data set to be reconciled does not fit in a single machine's memory, or when parallelizing the comparison across a cluster would reduce runtime from hours to minutes.
  • Why Apache IoTDB Is Written in Java: A Decade of Engineering Trade-offs
    7 projects | dev.to | 1 Apr 2026
    When IoTDB was initiated in 2011, almost all influential distributed systems and databases were built in Java or on the JVM—such as Hadoop, HBase, Spark (Scala on JVM), Cassandra, Kafka, and Flink. To integrate deeply with the big data ecosystem, choosing Java was a natural decision.
  • Apache Spark VS sail - a user suggested alternative
    2 projects | 18 Mar 2026
    2 projects | 18 Mar 2026
  • I Scraped 47M+ Hacker News Items Into Parquet Files – Here's What I Discovered About HN's Hidden Data Patterns
    2 projects | dev.to | 18 Mar 2026
    For handling even larger datasets or building production applications, Apache Spark provides excellent Parquet support with distributed processing capabilities.
  • Add Support for PyCapsule to Pyspark
    1 project | news.ycombinator.com | 28 Jan 2026
  • Pandas 3.0
    4 projects | news.ycombinator.com | 28 Jan 2026
    Funny enough, I actually just (2 weeks ago) added support for streaming from Pyspark to Polars/DuckDB/etc through Arrow PyCapsule. By streaming, I mean actually streaming, not collecting all data at once. It won't be released probably until May/June but it's there: https://github.com/apache/spark/commit/ecf179c3485ba8bac72af...
  • Show HN: Spark – Zero-config IoT deployment tool written in Rust
    2 projects | news.ycombinator.com | 8 Jan 2026
    You may want to consider renaming this project.

    The name "Spark" already refers to:

    A popular data analytics framework of the Apache Foundation: https://spark.apache.org/

    A subset of the Ada programming language used for formal verification: https://learn.adacore.com/courses/intro-to-spark/chapters/01...

    An Nvidia AI development system: https://www.nvidia.com/en-us/products/workstations/dgx-spark...

  • 15 AWS EMR Cost Optimization Tips to Slash Your EMR Spending (2025)
    4 projects | dev.to | 16 Dec 2025
    AWS EMR (Elastic MapReduce) is a fully managed big data platform. It manages the setup, configuration, and tuning of open source frameworks like Apache Hadoop, Apache Spark, Apache Hive, Presto, and more at scale on AWS infrastructure. EMR handles cluster scaling, resource allocation, and lifecycle management. This allows you to work with large datasets for various use cases, from ETL pipelines to ML workloads. EMR uses a pay-as-you-go pricing model. Costs for compute, storage, and other AWS services can add up quickly as your data grows, clusters get bigger, and jobs become more complex. If you're not careful, costs can skyrocket due to inefficient resource use, poor instance choices, and misconfigured storage. That's why AWS EMR Cost Optimization is key. It helps you get the best performance per dollar while maintaining data processing speed, reliability, and scalability.
  • A note from our sponsor - SaaSHub
    www.saashub.com | 16 Jun 2026
    SaaSHub helps you find the best software and product alternatives Learn more →

Stats

Basic Apache Spark repo stats
137
43,440
10.0
6 days ago

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com

Did you know that Scala is
the 37th most popular programming language
based on number of references?