Setting number of Spark tasks on a Cassandra table scan

问题

I have a simple Spark job reading 500m rows from a 5 node Cassandra cluster that always runs 6 tasks, which is causing write issues due to the size of each task. I have tried adjusting the input_split_size, which seems to have no effect. At the moment I am forced to repartition the table scan, which is not ideal as it's expensive.

Having read a few posts I tried to increase the num-executors in my launch script (below), although this had no effect.

If there is no way to set the number of tasks on a Cassandra table scan, that's fine I'll make do.. but I have this constant niggling feeling that I'm missing something here.

Spark workers live on the C* nodes which are 8-core, 64gb servers with 2TB SSDs in each.

...
val conf = new SparkConf(true).set("spark.cassandra.connection.host",
cassandraHost).setAppName("rowMigration")
  conf.set("spark.shuffle.memoryFraction", "0.4")
  conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  conf.set("spark.executor.memory", "15G")
  conf.set("spark.cassandra.input.split.size_in_mb", "32") //default 64mb
  conf.set("spark.cassandra.output.batch.size.bytes", "1000") //default
  conf.set("spark.cassandra.output.concurrent.writes", "5") //default

val sc = new SparkContext(conf)

val rawEvents = sc.cassandraTable(cassandraKeyspace, eventTable)
  .select("accountid", "userid", "eventname", "eventid", "eventproperties")
  .filter(row=>row.getString("accountid").equals("someAccount"))
  .repartition(100)

val object = rawEvents
  .map(ele => (ele.getString("userid"),
    UUID.randomUUID(),
    UUID.randomUUID(),
    ele.getUUID("eventid"),
    ele.getString("eventname"),
    "event type",
    UUIDs.unixTimestamp(ele.getUUID("eventid")),
    ele.getMap[String, String]("eventproperties"),
    Map[String, String](),
    Map[String, String](),
    Map[String, String]()))
  .map(row=>MyObject(row))

Object.saveToCassandra(targetCassandraKeyspace,eventTable)

launch script:

#!/bin/bash
export SHADED_JAR="Migrate.jar"
export SPARKHOME="${SPARKHOME:-/opt/spark}"
export SPARK_CLASSPATH="$SHADED_JAR:$SPARK_CLASSPATH"
export CLASS=com.migration.migrate
"${SPARKHOME}/bin/spark-submit" \
        --class "${CLASS}" \
        --jars $SHADED_JAR,$SHADED_JAR \
        --master spark://cas-1-5:7077  \
        --num-executors 15 \
        --executor-memory 20g \
        --executor-cores 4 "$SHADED_JAR" \
        --worker-cores 20 \
        -Dcassandra.connection.host=10.1.20.201 \
        -Dzookeeper.host=10.1.20.211:2181 \

EDIT - Following Piotr's answer:

I have set the ReadConf.splitCount on sc.cassandraTable as follows, however this does not change the number of tasks generated, meaning I still need to repartition the table scan. I'm starting to think that I'm thinking about this wrong and that the repartition is a necessity. Currently the job is taking about 1.5 hours, and repartitioning the table scan into 1000 tasks of roughly 10MB each has reduced the write time to minutes.

val cassReadConfig = new ReadConf {
      ReadConf.apply(splitCount = Option(1000)
        )
    }

    val sc = new SparkContext(conf)

    val rawEvents = sc.cassandraTable(cassandraKeyspace, eventTable)
    .withReadConf(readConf = cassReadConfig)

回答1:

Since spark connector 1.3, split sizes are estimated based on the system.size_estimates Cassandra table available since Cassandra 2.1.5. This table is refreshed periodically by Cassandra and soon after loading/removing new data or joining new nodes, its contents may be incorrect. Check if the estimates there reflect your data amount. It is a relatively new feature, so it is also quite possible there are some bugs there.

If the estimates are wrong, or you're running older Cassandra, we left an ability to override the automatic split size tuning. sc.cassandraTable takes ReadConf parameter in which you can set splitCount, which would force a fixed number of splits.

As for split_size_in_mb parameter, indeed there was a bug for some time in the project source, but it has been fixed before being released to any version published to maven. So unless you're compiling the connector from (old) source, you shouldn't hit it.

回答2:

There seems to be a bug with the split.size_in_mb parameter. The code may be interpreting it as bytes instead of megabytes, so try changing the 32 to something much bigger. See an example in the answers here.

来源：https://stackoverflow.com/questions/31672860/setting-number-of-spark-tasks-on-a-cassandra-table-scan

标签

scala

cassandra

apache-spark

spark-cassandra-connector