SELECT DISTINCT Cassandra in Spark

问题

I need a query that lists out the the unique Composite Partition Keys inside of spark.
The query in CASSANDRA: SELECT DISTINCT key1, key2, key3 FROM schema.table; is quite fast, however putting the same sort of data filter in a RDD or spark.sql retrieves results incredibly slowly in comparison.

e.g.

---- SPARK ----
var t1 = sc.cassandraTable("schema","table").select("key1", "key2", "key3").distinct()
var t2 = spark.sql("SELECT DISTINCT key1, key2, key3 FROM schema.table")

t1.count // takes 20 minutes
t2.count // takes 20 minutes

---- CASSANDRA ----
// takes < 1 minute while also printing out all results
SELECT DISTINCT key1, key2, key3 FROM schema.table;

where the table format is like:

CREATE TABLE schema.table (
    key1 text,
    key2 text,
    key3 text,
    ckey1 text,
    ckey2 text,
    v1 int,
    PRIMARY KEY ((key1, key2, key3), ckey1, ckey2)
);

Doesn't spark use cassandra optimisations in its' queries?
How can I retreive this information efficiently?

回答1:

Quick Answers

Doesn't spark use cassandra optimisations in its' queries?

Yes. But with SparkSQL only column pruning and predicate pushdowns. In RDDs it is manual.

How can I retreive this information efficiently?

Since your request returns quickly enough, I would just use the Java Driver directly to get this result set.

Long Answers

While Spark SQL can provide some C* based optimizations these are usually limited to predicate pushdowns when using the DataFrame interface. This is because the framework only provides limited information to the datasource. We can see this by doing an explain on the query you have written.

Lets start with the SparkSQL example

scala> spark.sql("SELECT DISTINCT key1, key2, key3 FROM test.tab").explain
== Physical Plan ==
*HashAggregate(keys=[key1#30, key2#31, key3#32], functions=[])
+- Exchange hashpartitioning(key1#30, key2#31, key3#32, 200)
   +- *HashAggregate(keys=[key1#30, key2#31, key3#32], functions=[])
      +- *Scan org.apache.spark.sql.cassandra.CassandraSourceRelation test.tab[key1#30,key2#31,key3#32] ReadSchema: struct<key1:string,key2:string,key3:string>

So your Spark example will actually be broken into several steps.

Scan : Read all the data from this table. This is means serializing every value from the C machine to the Spark Executor JVM, in other words lots of work.
*HashAggregate/Exchange/Hash Aggregate: Take the values from each executor, hash them locally then exchange the data between machines and hash again to ensure uniqueness. In layman's terms this means creating large hash structures, serializing them, running a complicated distributed sortmerge, then running a hash again. (Expensive)

Why doesn't any of this get pushed down to C*? This is because Datasource (The CassandraSourceRelation in this case) is not given the information about the Distinct part of the query. This is just part of how Spark currently works. Docs on what is pushable

So what about the RDD version?

With RDDS we give a direct set of instructions to Spark. This means if you want to push something down it must be manually specified. Let's see the debug output of the RDD request

scala> sc.cassandraTable("test","tab").distinct.toDebugString
res2: String =
(13) MapPartitionsRDD[7] at distinct at <console>:45 []
 |   ShuffledRDD[6] at distinct at <console>:45 []
 +-(13) MapPartitionsRDD[5] at distinct at <console>:45 []
    |   CassandraTableScanRDD[4] at RDD at CassandraRDD.scala:19 []

Here the issue is that your "distinct" call is a generic operation on an RDD and not specific to Cassandra. Since RDDs require all optimizations to be explicit (what you type is what you get) Cassandra never hears about this need for "Distinct" and we get a plan that is almost identical to our Spark SQL version. Do a full scan, serialize all of the data from Cassandra to Spark. Do a Shuffle and then return the results.

So what can we do about this?

With SparkSQL this is about as good as we can get without adding new rules to Catalyst (the SparkSQL/Dataframes Optimizer) to let it know that Cassandra can handle some distinct calls at the server level. It would then need to be implemented for the CassandraRDD subclasses.

For RDDs we would need to add a function like the already existing where, select, and limit, calls to the Cassandra RDD. A new Distinct call could be added here although it would only be allowable in specific situations. This is a function that currently does not exist in the SCC but could be added relatively easily since all it would do is prepend DISTINCT to requests and probably add some checking to make sure it is a DISTINCT that makes sense.

What can we do right now today without modifying the underlying connector?

Since we know the exact CQL request that we would like to make we can always use the Cassandra driver directly to get this information. The Spark Cassandra connector provides a driver pool we can use or we could just use the Java Driver natively. To use the pool we would do something like

import com.datastax.spark.connector.cql.CassandraConnector
CassandraConnector(sc.getConf).withSessionDo{ session => 
  session.execute("SELECT DISTINCT key1, key2, key3 FROM test.tab;").all()
}

And then parallelize the results if they are needed for further Spark work. If we really wanted to distribute this it would be necessary to most likely add the function to the Spark Cassandra Connector as I described above.

回答2:

As long as we are selecting the partition key, we can use the .perPartitionLimit function of the CassandraRDD:

val partition_keys = sc.cassandraTable("schema","table").select("key1", "key2", "key3").perPartitionLimit(1)

This works because, per SPARKC-436

select key from some_table per partition limit 1

gives the same result as

select distinct key from some_table

This feature was introduced in spark-cassandra-connector 2.0.0-RC1 and requires at least C* 3.6

回答3:

Distinct has a bad performance. Here there is a good answer with some alternatives: How to efficiently select distinct rows on an RDD based on a subset of its columns`

You can make use of toDebugString to have an idea of how many data your code shuffles.

来源：https://stackoverflow.com/questions/50055448/select-distinct-cassandra-in-spark

标签

apache-spark

cassandra

distinct