How does Distinct() function work in Spark?

后端未结

关注

 5  984

后悔当初 2020-12-02 15:43

I\'m a newbie to Apache Spark and was learning basic functionalities. Had a small doubt.Suppose I have an RDD of tuples (key, value) and wanted to obtain some unique ones ou

5条回答

清歌不尽 (楼主)

2020-12-02 16:33
Justin Pihony is right. Distinct uses the hashCode and equals method of the objects for this determination. It's return the distinct elements(object)
```
val rdd = sc.parallelize(List((1,20), (1,21), (1,20), (2,20), (2,22), (2,20), (3,21), (3,22)))
```
Distinct
```
rdd.distinct.collect().foreach(println)
(2,22)
(1,20)
(3,22)
(2,20)
(1,21)
(3,21)
```
If you want to apply distinct on key. In that case reduce by is better option

ReduceBy
```
 val reduceRDD= rdd.map(tup =>
    (tup._1, tup)).reduceByKey { case (a, b) => a }.map(_._2)

reduceRDD.collect().foreach(println)
```
Output:-
```
(2,20)
(1,20)
(3,21)
```
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...