Our great sponsors
-
distributed-joins
Optimizing Distributed Joins: The Case of Google Cloud Spanner and DataStax Astra DB
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
This data model and queries can be readily instantiated in any relational database, including Google Cloud Spanner (see this SQL script as an example), but that would not result in the join optimizations we are looking to implement. We show how to do much better in the next two sections.
Shuffle and broadcast joins are more suitable for batch or near real-time analytics. For example, they are used in Apache Spark as the main join strategies. Co-located and pre-computed joins are faster and can be used for online transaction processing with real-time applications. They frequently rely on organizing data based on unique storage schemes supported by a database.
To try this example in DataStax Astra DB, we share our CQL script. If you are new to CQL, it stands for Cassandra Query Language and is used in both DataStax Astra DB and Apache Cassandra. Astra DB is a serverless and multi-region database service that is based on Apache Cassandra, an open-source NoSQL database. To learn more about CQL and tables with multi-row partitions, the hands-on Cassandra Fundamentals learning series is highly recommended. For more advanced data modeling, there is also the collection of data modeling examples from various domains.