Apache Spark taking 5 to 6 minutes for simple count of 1 billon rows from Cassandra

岁酱吖の 提交于 2019-11-28 09:31:37

After searching on Google i fond the issue in the latest spark-cassandra-connector. The parameter spark.cassandra.input.split.size_in_mb Default value is 64 MB which is being interpreted as 64 bytes in the code. So try with spark.cassandra.input.split.size_in_mb = 64 * 1024 * 1024 = 67108864

Hear is an example :

public static void main(String[] args) {

    SparkConf conf = new SparkConf(true).setMaster("local[4]")
            .setAppName("App_Name")
            .set("spark.cassandra.connection.host", "127.0.0.1")
            .set("spark.cassandra.input.split.size_in_mb","67108864");


    JavaSparkContext sc = new JavaSparkContext(conf);

    JavaRDD<Demo_Bean> empRDD = javaFunctions(sc).cassandraTable("dev",
            "demo");
    System.out.println("Row Count"+empRDD.count());
}
Jim Meyer

To speed it up you might try setting the spark.cassandra.input.split.size_in_mb when you create the SparkConf.

It could be that the executors are trying to read all the rows into memory at once. If they don't all fit, it might cause it to page the RDD to disk, resulting in the slow time. By specifying a split size, it would count the rows in chunks and then discard them rather than paging to disk.

You can see an example of how to set the split size here.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!