Apache Spark taking 5 to 6 minutes for simple count of 1 billon rows from Cassandra

泪湿孤枕 提交于 2019-12-17 19:04:28

问题


I am using the Spark Cassandra connector. It take 5-6 minutes for fetch data from Cassandra table. In Spark I have seen many tasks and Executor in log. The reason might be that Spark divided the process in many tasks!

Below is my code example :

public static void main(String[] args) {

    SparkConf conf = new SparkConf(true).setMaster("local[4]")
            .setAppName("App_Name")
            .set("spark.cassandra.connection.host", "127.0.0.1");

    JavaSparkContext sc = new JavaSparkContext(conf);

    JavaRDD<Demo_Bean> empRDD = javaFunctions(sc).cassandraTable("dev",
            "demo");
    System.out.println("Row Count"+empRDD.count());
}

回答1:


After searching on Google i fond the issue in the latest spark-cassandra-connector. The parameter spark.cassandra.input.split.size_in_mb Default value is 64 MB which is being interpreted as 64 bytes in the code. So try with spark.cassandra.input.split.size_in_mb = 64 * 1024 * 1024 = 67108864

Hear is an example :

public static void main(String[] args) {

    SparkConf conf = new SparkConf(true).setMaster("local[4]")
            .setAppName("App_Name")
            .set("spark.cassandra.connection.host", "127.0.0.1")
            .set("spark.cassandra.input.split.size_in_mb","67108864");


    JavaSparkContext sc = new JavaSparkContext(conf);

    JavaRDD<Demo_Bean> empRDD = javaFunctions(sc).cassandraTable("dev",
            "demo");
    System.out.println("Row Count"+empRDD.count());
}



回答2:


To speed it up you might try setting the spark.cassandra.input.split.size_in_mb when you create the SparkConf.

It could be that the executors are trying to read all the rows into memory at once. If they don't all fit, it might cause it to page the RDD to disk, resulting in the slow time. By specifying a split size, it would count the rows in chunks and then discard them rather than paging to disk.

You can see an example of how to set the split size here.



来源:https://stackoverflow.com/questions/31583249/apache-spark-taking-5-to-6-minutes-for-simple-count-of-1-billon-rows-from-cassan

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!