Apache Spark taking 5 to 6 minutes for simple count of 1 billon rows from Cassandra

问题

I am using the Spark Cassandra connector. It take 5-6 minutes for fetch data from Cassandra table. In Spark I have seen many tasks and Executor in log. The reason might be that Spark divided the process in many tasks!

Below is my code example :

public static void main(String[] args) {

    SparkConf conf = new SparkConf(true).setMaster("local[4]")
            .setAppName("App_Name")
            .set("spark.cassandra.connection.host", "127.0.0.1");

    JavaSparkContext sc = new JavaSparkContext(conf);

    JavaRDD<Demo_Bean> empRDD = javaFunctions(sc).cassandraTable("dev",
            "demo");
    System.out.println("Row Count"+empRDD.count());
}

回答1:

After searching on Google i fond the issue in the latest spark-cassandra-connector. The parameter spark.cassandra.input.split.size_in_mb Default value is 64 MB which is being interpreted as 64 bytes in the code. So try with spark.cassandra.input.split.size_in_mb = 64 * 1024 * 1024 = 67108864

Hear is an example :

public static void main(String[] args) {

    SparkConf conf = new SparkConf(true).setMaster("local[4]")
            .setAppName("App_Name")
            .set("spark.cassandra.connection.host", "127.0.0.1")
            .set("spark.cassandra.input.split.size_in_mb","67108864");


    JavaSparkContext sc = new JavaSparkContext(conf);

    JavaRDD<Demo_Bean> empRDD = javaFunctions(sc).cassandraTable("dev",
            "demo");
    System.out.println("Row Count"+empRDD.count());
}

回答2:

To speed it up you might try setting the spark.cassandra.input.split.size_in_mb when you create the SparkConf.

It could be that the executors are trying to read all the rows into memory at once. If they don't all fit, it might cause it to page the RDD to disk, resulting in the slow time. By specifying a split size, it would count the rows in chunks and then discard them rather than paging to disk.

You can see an example of how to set the split size here.

来源：https://stackoverflow.com/questions/31583249/apache-spark-taking-5-to-6-minutes-for-simple-count-of-1-billon-rows-from-cassan

标签

java

cassandra

apache-spark

spark-cassandra-connector