Why Mongo Spark connector returns different and incorrect counts for a query?

落花浮王杯 提交于 2019-12-01 04:32:48

I solved my issue. The reason of inconsistent counts was the MongoDefaultPartitioner which wraps MongoSamplePartitioner which uses random sampling. To be honest this is quite a weird default as for me. I personally would prefer to have a slow but a consistent partitioner instead. The details for partitioner options can be found in the official configuration options documentation.

code:

val df = spark.read
  .format("com.mongodb.spark.sql.DefaultSource")
  .option("uri", "mongodb://127.0.0.1/enron_mail.messages")
  .option("partitioner", "spark.mongodb.input.partitionerOptions.MongoPaginateBySizePartitioner ")
  .load()

This issue was mostly due to SPARK-151 bug in 2.2.0 Mongo Connector. It is resolved in 2.2.1 version, which I have confirmed. You can continue to use default partitioner with 2.2.1.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!