Apache Spark
Trino
Our great sponsors
Apache Spark | Trino | |
---|---|---|
48 | 13 | |
32,903 | 5,434 | |
1.6% | 5.7% | |
10.0 | 10.0 | |
6 days ago | 4 days ago | |
Scala | Java | |
Apache License 2.0 | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
Apache Spark
-
What do I need to know about distributed algorithms and systems?
You generally want to keep your data in memory, rather than disk, to keep things reasonably fast. A system like Apache Spark tries to do this for you, spilling to disk when needed. In general, I'd recommend researching Spark, since it will cover a lot of the concepts you care about.
-
How to use Spark and Pandas to prepare big data
Apache Spark is one of the most actively developed open-source projects in big data. The following code examples require that you have Spark set up and can execute Python code using the PySpark library. The examples also require that you have your data in Amazon S3 (Simple Storage Service). All this is set up on AWS EMR (Elastic MapReduce).
-
AWS Glue: what is it and how does it work?
With Glue, Apache Spark runs in the background. But if this is the first time you’ve heard of the popular open-source analytics engine, it may take you a while to familiarize yourself with the cloud software.
-
Real-time Open Source Indexes: Databases, Headless CMSs and Static Site Generators
Spark SQL (302 active contributors).
-
Top Responsibilities of a Data Engineering Manager
What’s more, picking the right technology is always evolving. New tools come out all the time, often with different functionality than existing tools. So it’s important that you stay up-to-date on what technologies are available and their latest features. For example, four years ago Apache Spark was completely unknown but today it is quickly becoming the de facto standard for stream processing.
-
Apache Spark, Hive, and Spring Boot — Testing Guide
In this article, I'm showing you how to create a Spring Boot app that loads data from Apache Hive via Apache Spark to the Aerospike Database. More than that, I'm giving you a recipe for writing integration tests for such scenarios that can be run either locally or during the CI pipeline execution. The code examples are taken from this repository.
-
Cannot find col function in pyspark
from pyspark.sql.functions import col but when I try to look it up in the Github source code I find no col function in functions.py file, how can python import a function that doesn't exist?
-
How To Start Your Next Data Engineering Project
Apache Spark
-
Big Data Processing, EMR with Spark and Hadoop | Python, PySpark
Apache Spark is an open-source, distributed processing system used for big data workloads. Wanna dig more dipper?
-
Rule
Spark Legend-er
Trino
-
Feasibility on startup idea related to data pipelines
For querying various databases, Trino is a distributed SQL query engine that could help - https://trino.io/
-
How Does The Data Lakehouse Enhance The Customer Data Stack?
Processing has also evolved since Hadoop. First, we had the introduction of Spark that offered an API for Map-Reduce that was more user-friendly, and then we got distributed query engines like Trino. These two processing frameworks co-exist most of the time, addressing different needs. Trino is mainly used for analytical online queries where latency is important while Spark is heavily used for bigger workloads (think ETL) where the volume of data is much bigger and latency is not so important.
- Distributed SQL query engine for big data
-
What Is Trino And Why Is It Great At Processing Big Data
Let's be clear. Trino is not a database. This is a misconception. Just because you utilize Trino to run SQL against data, doesn't mean it's a database.
-
Learn SQL
You might find https://trino.io/ interesting. It allows you to bolt on a MPP SQL execution engine on top of any data source including pre-built connectors for Druid and Kafka.
It's all ANSI SQL and the best part is you can combine data from heterogenous sources. e.g. You can join data between a topic in Kafka and a table in Druid or even between Kafka, S3 and your RDBMS.
Disclaimer: I'm a maintainer of the project.
-
What even is data mesh
Not central to the main ideas of this article, but if you want to have a data mesh that is self-service, why force folks to use a particular storage medium like a data warehouse? That still requires centralization of the data.
Why not instead have a tool like Trino (https://trino.io) that allows you to let different domains use whatever datastore they happen to use. You still would need to enforce schema, but this can be done in tools like schema registry as mentioned in the article along with a data cataloging tool.
These tools facilitate the distributed nature of the problem nicely and encourage healthy standards to be discussed and the formalized in schema definitions and catalogs that remove the ambiguity of discourse and documentation.
Nice example is laid out in this repo of how Trino can accomplish data mesh principles 1 and 3 (https://github.com/findinpath/trino_data_mesh).
-
What is Cost-based Optimization?
In Presto/Trino, the cost is a vector of estimated CPU, memory, and network usage. The vector is also converted into a scalar value during comparison.
-
Looking for Feedback: Open Source SQL-in-Markdown Reporting tool
Love it! I'd like it to be able to talk to Trino. I'm not sure if there's a driver for node but I could help build it.
-
ClickHouse: An open-source column-oriented database management system
Take a look at query engines like Trino (formerly PrestoSQL) [https://trino.io/]. (Disclaimer: I'm a contributor to Trino).
I used it at a previous job to combine data from MongoDB, Kafka, S3 and Postgres to great effect. It tries to push-down as many operations as possible to the source too to improve performance.
Full ANSI SQL support over multiple number of backends (Kafka, Cassandra, Postgres, ClickHouse, S3 and many more).
The best part is it has a plugin ecosystem so you can very easily implement your own connectors and all the heavy lifting gets done by the core-engine while your plugin only has to abstract your backend to concepts that the engine can understand.
-
Why hasn't Presto become industry standard?
* Active-active HA is not really necessary IMO as Trino is designed for low latency interactive queries in general. It can handle longer running batch queries but it gives up fault tolerance to fail fast and you just resubmit the query vs predecessors like Hive, Spark, etc... that handle ETL and long running batch processes efficiently but this adds complexity to the query to checkpoint the work. I could see the need for an active-passive HA to have on deck during a failure. Setting up your own active-passive HA is as simple as putting two coordinators behind a proxy and pointing your workers to the proxy address. Then you basically have the proxy run health checks and flip over in the event of an outage. Here's the issue to track native HA though https://github.com/trinodb/trino/issues/391.
What are some alternatives?
dremio-oss - Dremio - the missing link in modern data
Apache Drill - Apache Drill is a distributed MPP query layer for self describing data
Apache Calcite - Apache Calcite
ClickHouse - ClickHouse® is a free analytics DBMS for big data
Presto - The official home of the Presto distributed SQL query engine for big data
Scalding - A Scala API for Cascading
mrjob - Run MapReduce jobs on Hadoop or Amazon Web Services
luigi - Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Pytorch - Tensors and Dynamic neural networks in Python with strong GPU acceleration
Smile - Statistical Machine Intelligence & Learning Engine
Weka
Airflow - Apache Airflow - A platform to programmatically author, schedule, and monitor workflows