Apache Spark

Apache Spark - A unified analytics engine for large-scale data processing (by apache)

Apache Spark Alternatives

Similar projects and alternatives to Apache Spark

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a better Apache Spark alternative or higher similarity.

Apache Spark reviews and mentions

Posts with mentions or reviews of Apache Spark. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2022-11-29.
  • What is the separation of storage and compute in data platforms and why does it matter?
    3 projects | dev.to | 29 Nov 2022
    However, once your data reaches a certain size or you reach the limits of vertical scaling, it may be necessary to distribute your queries across a cluster, or scale horizontally. This is where distributed query engines like Trino and Spark come in. Distributed query engines make use of a coordinator to plan the query and multiple worker nodes to execute them in parallel.
  • Deequ for generating data quality reports
    3 projects | dev.to | 24 Nov 2022
    aws documentation — Deequ allows you to calculate data quality metrics on your dataset, define and verify data quality constraints, and be informed about changes in the data distribution. Instead of implementing checks and verification algorithms on your own, you can focus on describing how your data should look. Deequ supports you by suggesting checks for you. Deequ is implemented on top of Apache Spark and is designed to scale with large datasets (think billions of rows) that typically live in a distributed filesystem or a data warehouse.
  • In One Minute : Hadoop
    10 projects | dev.to | 21 Nov 2022
    Spark, a fast and general engine for large-scale data processing.
  • Machine Learning Pipelines with Spark: Introductory Guide (Part 1)
    5 projects | dev.to | 23 Oct 2022
    Apache Spark is a fast and general open-source engine for large-scale, distributed data processing. Its flexible in-memory framework allows it to handle batch and real-time analytics alongside distributed data processing.
  • A peek into Location Data Science at Ola
    6 projects | dev.to | 26 Sep 2022
    This requires the use of distributed computation tools such as Spark and Hadoop, Flink and Kafka are used. But for occasional experimentation, Pandas, Geopandas and Dask are some of the commonly used tools.
  • System Design: Uber
    4 projects | dev.to | 21 Sep 2022
    Recording analytics and metrics is one of our extended requirements. We can capture the data from different services and run analytics on the data using Apache Spark which is an open-source unified analytics engine for large-scale data processing. Additionally, we can store critical metadata in the views table to increase data points within our data.
  • System Design: Twitter
    5 projects | dev.to | 21 Sep 2022
    Recording analytics and metrics is one of our extended requirements. As we will be using Apache Kafka to publish all sorts of events, we can process these events and run analytics on the data using Apache Spark which is an open-source unified analytics engine for large-scale data processing.
  • How the world caught up with Apache Cassandra
    4 projects | dev.to | 15 Sep 2022
    Cassandra survived its adolescent years by retaining its position as the database that scales more reliably than anything else, with a continual pursuit of operational simplicity at scale. It demonstrated its value even further by integrating with a broader data infrastructure stack of open source components, including the analytics engine Apache Spark, stream-processing platform Apache Kafka, and others.
  • Why we don’t use Spark
    2 projects | dev.to | 7 Sep 2022
    Most people working in big data know Spark (if you don't, check out their website) as the standard tool to Extract, Transform & Load (ETL) their heaps of data. Spark, the successor of Hadoop & MapReduce, works a lot like Pandas, a data science package where you run operators over collections of data. These operators then return new data collections, which allows the chaining of operators in a functional way while keeping scalability in mind.
  • Tracking Aircraft in Real-Time With Open Source
    17 projects | dev.to | 1 Sep 2022
    Apache Spark
  • A note from our sponsor - InfluxDB
    www.influxdata.com | 4 Dec 2022
    InfluxDB is the Time Series Data Platform where developers build real-time applications for analytics, IoT and cloud-native services in less time with less code. Learn more →


Basic Apache Spark repo stats
3 days ago
Delete the most useless function ever: context switching.
Zigi monitors Jira and GitHub updates, pings you when PRs need approval and lets you take fast actions - all directly from Slack! Plus it reduces cycle time by up to 75%.