Is there an alternative to joinWithCassandraTable for DataFrames in Spark (Scala) when retrieving data from only certain Cassandra partitions?

问题

When extracting small number of partitions from large C* table using RDDs, we can use this:

val rdd = …  // rdd including partition data
val data = rdd.repartitionByCassandraReplica(keyspace, tableName)
    .joinWithCassandraTable(keyspace, tableName)

Do we have available an equally effective approach using DataFrames?

Update (Apr 26, 2017):

To be more concrete, I prepared an example.

I have 2 tables in Cassandra:

CREATE TABLE ids (
   id text,
   registered timestamp,
   PRIMARY KEY (id)
)

CREATE TABLE cpu_utils (
   id text,
   date text,
   time timestamp,
   cpu_util int,
   PRIMARY KEY (( id, date ), time)
)

The first one contains a list of valid IDs and the second one cpu utilization data. I would like to efficiently get average cpu utilization per each id in table ids for one day, say "2017-04-25".

The most efficient way with the RDDs that I know of is the following:

val sc: SparkContext = ...
val date = "2017-04-25"
val partitions = sc.cassandraTable(keyspace, "ids")
  .select("id").map(r => (r.getString("id"), date))

val data = partitions.repartitionByCassandraReplica(keyspace, "cpu_utils")
  .joinWithCassandraTable(keyspace, "cpu_utils")
  .select("id", "cpu_util").values
  .map(r => (r.getString("id"), (r.getDouble("cpu_util"), 1)))

// aggrData in form: (id, (avg(cpu_util), count))
// example row: ("718be4d5-11ad-4849-8aab-aa563c9c290e",(6,723))
val aggrData = data.reduceByKey((a, b) => (
  1d * (a._1 * a._2 + b._1 * b._2) / (a._2 + b._2), 
  a._2 + b._2))

aggrData.foreach(println)

This approach takes about 5 seconds to complete (setup with Spark on my local machine, Cassandra on some remote server). Using it, I am performing operations on less than 1% of partitions in table cpu_utils .

With the Dataframes this is the approach I am using currently:

val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val date = "2017-04-25"

val partitions = sqlContext.read.format("org.apache.spark.sql.cassandra")
  .options(Map("table" -> "ids", "keyspace" -> keyspace)).load()
  .select($"id").withColumn("date", lit(date))

val data: DataFrame = sqlContext.read.format("org.apache.spark.sql.cassandra")
  .options(Map("table" -> "cpu_utils", "keyspace" -> keyspace)).load()
  .select($"id", $"cpu_util", $"date")

val dataFinal = partitions.join(data, partitions.col("id").equalTo(data.col("id")) and partitions.col("date").equalTo(data.col("date")))
  .select(data.col("id"), data.col("cpu_util"))
  .groupBy("id")
  .agg(avg("cpu_util"), count("cpu_util"))

dataFinal.show()

However, this approach seems to load the whole table cpu_utils into memory as execution time here is considerably longer (almost 1 minute).

I am asking if there exists a better approach using Dataframes that would at least reach if not perform better than the RDD approach mentioned above?

P.s.: I am using Spark 1.6.1.

来源：https://stackoverflow.com/questions/43552506/is-there-an-alternative-to-joinwithcassandratable-for-dataframes-in-spark-scala

标签

scala

apache-spark

cassandra

spark-cassandra-connector