How does Distinct() function work in Spark?

后端 未结 5 984
后悔当初
后悔当初 2020-12-02 15:43

I\'m a newbie to Apache Spark and was learning basic functionalities. Had a small doubt.Suppose I have an RDD of tuples (key, value) and wanted to obtain some unique ones ou

5条回答
  •  清歌不尽
    2020-12-02 16:33

    Justin Pihony is right. Distinct uses the hashCode and equals method of the objects for this determination. It's return the distinct elements(object)

    val rdd = sc.parallelize(List((1,20), (1,21), (1,20), (2,20), (2,22), (2,20), (3,21), (3,22)))
    

    Distinct

    rdd.distinct.collect().foreach(println)
    (2,22)
    (1,20)
    (3,22)
    (2,20)
    (1,21)
    (3,21)
    

    If you want to apply distinct on key. In that case reduce by is better option

    ReduceBy

     val reduceRDD= rdd.map(tup =>
        (tup._1, tup)).reduceByKey { case (a, b) => a }.map(_._2)
    
    reduceRDD.collect().foreach(println)
    

    Output:-

    (2,20)
    (1,20)
    (3,21)
    

提交回复
热议问题