Setting number of Spark tasks on a Cassandra table scan
I have a simple Spark job reading 500m rows from a 5 node Cassandra cluster that always runs 6 tasks, which is causing write issues due to the size of each task. I have tried adjusting the input_split_size, which seems to have no effect. At the moment I am forced to repartition the table scan, which is not ideal as it's expensive. Having read a few posts I tried to increase the num-executors in my launch script (below), although this had no effect. If there is no way to set the number of tasks on a Cassandra table scan, that's fine I'll make do.. but I have this constant niggling feeling that