Apache Spark VS Trino

Compare Apache Spark vs Trino and see what are their differences.

Apache Spark

Apache Spark - A unified analytics engine for large-scale data processing (by apache)

Trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io) (by trinodb)
Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
Apache Spark Trino
101 44
38,249 9,519
1.0% 2.3%
10.0 10.0
2 days ago about 2 hours ago
Scala Java
Apache License 2.0 Apache License 2.0
The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

Apache Spark

Posts with mentions or reviews of Apache Spark. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2024-03-11.

Trino

Posts with mentions or reviews of Trino. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-11-16.
  • Your Thoughts on OLAPs Clickhouse vs Apache Druid vs Starrocks in 2023/2024
    2 projects | /r/dataengineering | 16 Nov 2023
    DevRel for StarRocks. Trino doesn't have a great caching layer (https://github.com/trinodb/trino/pull/16375) and performance (https://github.com/trinodb/trino/issues/14237) and https://github.com/oap-project/Gluten-Trino. In benchmarks and community user testing, StarRocks has outperformed.
  • Making Hard Things Easy
    11 projects | news.ycombinator.com | 6 Oct 2023
    What if my SQL engine is Presto, Trino [1], or a similar query engine? If it's federating multiple source databases we peel the SQL back and get... SQL? Or you peel the SQL back and get... S3 + Mongo + Hadoop? Junior analysts would work at 1/10th the speed if they had to use those raw.

    [1] https://trino.io/

  • Questions about Athena, Trino and Iceberg
    2 projects | /r/dataengineering | 15 Jun 2023
    The good thing is that the concepts in terms to the SQL supported by Trino transfers between them all. So its completely reasonable to start with one and move to another. In fact that is something that happens regularly. I invite to you check out the talks from the Trino Fest event that is just wrapping up today. There are presentations about all these aspects and different scenarios users encounter. All videos and slides will go live on the Trino website soon. Also feel free to join the Trino slack to chat about about all this with other users.
  • Apache Iceberg as storage for on-premise data store (cluster)
    3 projects | /r/dataengineering | 16 Mar 2023
    Trino or Hive for SQL querying. Get Trino/Hive to talk to Nessie.
  • Uber Interview Experience/Asking Suggestions
    4 projects | /r/dataengineering | 1 Feb 2023
    One place to look are the projects repo's and docs, once you have a good idea of how the system is architected poking around pieces of the codebase can be helpful in letting you really understand their internals. I personally enjoy going through spark repo and trino repo and the documentation for both projects is decent and can answer many of your questions.
  • Java OSS with best code quality you’ve ever seen?
    3 projects | /r/java | 14 Jan 2023
  • What is the separation of storage and compute in data platforms and why does it matter?
    3 projects | dev.to | 29 Nov 2022
    However, once your data reaches a certain size or you reach the limits of vertical scaling, it may be necessary to distribute your queries across a cluster, or scale horizontally. This is where distributed query engines like Trino and Spark come in. Distributed query engines make use of a coordinator to plan the query and multiple worker nodes to execute them in parallel.
  • Sparkless is born
    2 projects | /r/apachespark | 24 Nov 2022
  • Why use Spark at all?
    2 projects | /r/dataengineering | 19 Oct 2022
  • Data Engineer Github Profile?
    3 projects | /r/dataengineering | 9 Oct 2022
    So, there are a lot of people involved which is why you try to find some common ground regarding coding style, documentation, programming language version, … Most larger project have some guidelines you can read to see what is necessary for a PR to be approved: Trino’s Guidelines or dbt.

What are some alternatives?

When comparing Apache Spark and Trino you can also consider the following projects:

dremio-oss - Dremio - the missing link in modern data

Presto - The official home of the Presto distributed SQL query engine for big data

Apache Drill - Apache Drill is a distributed MPP query layer for self describing data

Apache Calcite - Apache Calcite

Pytorch - Tensors and Dynamic neural networks in Python with strong GPU acceleration

ClickHouse - ClickHouse® is a free analytics DBMS for big data

Airflow - Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

Scalding - A Scala API for Cascading

spring-data-jpa-mongodb-expressions - Use the MongoDB query language to query your relational database, typically from frontend.

mrjob - Run MapReduce jobs on Hadoop or Amazon Web Services

luigi - Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.

Apache Arrow - Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing