Setting number of Spark tasks on a Cassandra table scan

寵の児 提交于 2019-12-06 05:15:54

Since spark connector 1.3, split sizes are estimated based on the system.size_estimates Cassandra table available since Cassandra 2.1.5. This table is refreshed periodically by Cassandra and soon after loading/removing new data or joining new nodes, its contents may be incorrect. Check if the estimates there reflect your data amount. It is a relatively new feature, so it is also quite possible there are some bugs there.

If the estimates are wrong, or you're running older Cassandra, we left an ability to override the automatic split size tuning. sc.cassandraTable takes ReadConf parameter in which you can set splitCount, which would force a fixed number of splits.

As for split_size_in_mb parameter, indeed there was a bug for some time in the project source, but it has been fixed before being released to any version published to maven. So unless you're compiling the connector from (old) source, you shouldn't hit it.

Jim Meyer

There seems to be a bug with the split.size_in_mb parameter. The code may be interpreting it as bytes instead of megabytes, so try changing the 32 to something much bigger. See an example in the answers here.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!