Is there a way to rewrite Spark RDD distinct to use mapPartitions instead of distinct?

后端未结

关注

 2  1765

佛祖请我去吃肉 2020-12-30 14:59

I have an RDD that is too large to consistently perform a distinct statement without spurious errors (e.g. SparkException stage failed 4 times, ExecutorLostFailure, HDFS Fil

2条回答

谎友^ (楼主)

2020-12-30 15:40
It might be better to figure out if there is another underlying issue, but the below will do what you want...rather round about way to do it, but it sounds like it will fit your bill:
```
myRDD.map(a => (a._2._1._2, a._2._1._2))
  .aggregateByKey(Set[YourType]())((agg, value) => agg + value, (agg1, agg2) => agg1 ++ agg2) 
  .keys
  .count
```
Or even this seems to work, but it isn't associative and commutative. It works due to how the internals of Spark works...but I might be missing a case...so while simpler, I'm not sure I trust it:
```
myRDD.map(a => (a._2._1._2, a._2._1._2))
  .aggregateByKey(YourTypeDefault)((x,y)=>y, (x,y)=>x)
  .keys.count
```
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...