Reading from cassandra in Spark does not return all the data when using JoinWithCassandraTable

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

spark-cassandra-connector

1 1,930 5.1 Scala

DataStax Connector for Apache Spark to Apache Cassandra (by datastax)

This works perfectly fine and I get all the data I'm expecting. However if I change spark.cassandra.sql.inClauseToJoinConversionThreshold(see https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md) to something lower like 20 which means I hit the threshold (my cross-product is 10*10=100) and JoinWithCassandraTable will be used. I suddenly do not get all the data, and on top of that I get duplicated rows for some of the data. It looks like I'm completely missing some of the partition keys, and some of the partition keys return duplicated rows (this quick-analysis might however be wrong).

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

A glimpse into the future of data processing infrastructure.

1 project | dev.to | 2 May 2024
Coroutines and Effects

3 projects | news.ycombinator.com | 21 Apr 2024
The dangers of single line regular expressions

1 project | news.ycombinator.com | 22 Apr 2024
1800-2023 – IEEE Standard for SystemVerilog

1 project | news.ycombinator.com | 17 Apr 2024
JHipster 8 - Criando uma aplicação monolítica

4 projects | dev.to | 11 Apr 2024

Reading from cassandra in Spark does not return all the data when using JoinWithCassandraTable

This page summarizes the projects mentioned and recommended in the original post on /r/apachespark Post date: 9 Mar 2022

spark-cassandra-connector

InfluxDB

Related posts

A glimpse into the future of data processing infrastructure.

Coroutines and Effects

The dangers of single line regular expressions

1800-2023 – IEEE Standard for SystemVerilog

JHipster 8 - Criando uma aplicação monolítica