I have setup Spark 2.0 and Cassandra 3.0 on a local machine (8 cores, 16gb ram) for testing purposes and edited spark-defaults.conf
as follows:
Is that the expected performance? If not, what am I missing?
It looks slowish but it is not exactly unexpected. In general count
is expressed as
SELECT 1 FROM table
followed by Spark side summation. So while it is optimized it still rather inefficient because you have fetch N long integers from the external source just to sum these locally.
As explained by the docs Cassandra backed RDD (not Datasets
) provide optimized cassandraCount
method which performs server side counting.
Theory says the number of partitions of a DataFrame determines the number of tasks Spark will distribute the job in. If I am setting the
spark.sql.shuffle.partitions
to (...), why is creating (...) Tasks?
Because spark.sql.shuffle.partitions
is not used here. This property is used to determine number of partitions for shuffles (when data is aggregated by some set of keys) not for Dataset
creation or global aggregations like count(*)
(which always use 1 partition for final aggregation).
If you interested in controlling number of initial partitions you should take a look at spark.cassandra.input.split.size_in_mb which defines:
Approx amount of data to be fetched into a Spark partition. Minimum number of resulting Spark partitions is 1 + 2 * SparkContext.defaultParallelism
As you can see another factor here is spark.default.parallelism
but it is not exactly a subtle configuration so depending on it in general is not an optimal choice.