I have an RDD that is too large to consistently perform a distinct statement without spurious errors (e.g. SparkException stage failed 4 times, ExecutorLostFailure, HDFS Fil
It might be better to figure out if there is another underlying issue, but the below will do what you want...rather round about way to do it, but it sounds like it will fit your bill:
myRDD.map(a => (a._2._1._2, a._2._1._2))
.aggregateByKey(Set[YourType]())((agg, value) => agg + value, (agg1, agg2) => agg1 ++ agg2)
.keys
.count
Or even this seems to work, but it isn't associative and commutative. It works due to how the internals of Spark works...but I might be missing a case...so while simpler, I'm not sure I trust it:
myRDD.map(a => (a._2._1._2, a._2._1._2))
.aggregateByKey(YourTypeDefault)((x,y)=>y, (x,y)=>x)
.keys.count