I have 3 Cassandra node cluster with 1 seed node and 1 spark master and 3 slave nodes with 8 GB ram and 2 cores. Here is the input to my spark jobs
spark.cassandra.input.split.size_in_mb 67108864
When I run with this configuration set I see that there are around 768 partitions created with around 89.1 MB of data roughly 1706765 records. I am not able to understand why so many partitions are created. I am using Cassandra spark connector version 1.4 so the bug is also fixed regarding input split size.
There are only 11 unique partition key. My partition key has appname which is always test and random number which is always from 0-10 so only 11 different unique partition.
Why so many partitions and how come spark decide how much partitions to create
The Cassandra connector does not use defaultParallelism. It checks a system table in C* (post 2.1.5) for an estimate on how many MB of data are in the given table. This amount is read and divided by the input split size to determine the number of splits to make.
If you are on C* < 2.1.5 you will need to manually set the partitioning via a ReadConf.
来源:https://stackoverflow.com/questions/32799122/spark-creating-too-many-partitions