Is there a way to rewrite Spark RDD distinct to use mapPartitions instead of distinct?

后端 未结 2 1765
佛祖请我去吃肉
佛祖请我去吃肉 2020-12-30 14:59

I have an RDD that is too large to consistently perform a distinct statement without spurious errors (e.g. SparkException stage failed 4 times, ExecutorLostFailure, HDFS Fil

2条回答
  •  谎友^
    谎友^ (楼主)
    2020-12-30 15:40

    It might be better to figure out if there is another underlying issue, but the below will do what you want...rather round about way to do it, but it sounds like it will fit your bill:

    myRDD.map(a => (a._2._1._2, a._2._1._2))
      .aggregateByKey(Set[YourType]())((agg, value) => agg + value, (agg1, agg2) => agg1 ++ agg2) 
      .keys
      .count
    

    Or even this seems to work, but it isn't associative and commutative. It works due to how the internals of Spark works...but I might be missing a case...so while simpler, I'm not sure I trust it:

    myRDD.map(a => (a._2._1._2, a._2._1._2))
      .aggregateByKey(YourTypeDefault)((x,y)=>y, (x,y)=>x)
      .keys.count
    

提交回复
热议问题