How does Distinct() function work in Spark?

后端未结

关注

 5  970

后悔当初 2020-12-02 15:43

I\'m a newbie to Apache Spark and was learning basic functionalities. Had a small doubt.Suppose I have an RDD of tuples (key, value) and wanted to obtain some unique ones ou

5条回答

庸人自扰 (楼主)

2020-12-02 16:36

The API docs for RDD.distinct() only provide a one sentence description:

"Return a new RDD containing the distinct elements in this RDD."

From recent experience I can tell you that in a tuple-RDD the tuple as a whole is considered.

If you want distinct keys or distinct values, then depending on exactly what you want to accomplish, you can either:

A. call groupByKey() to transform {(k1,v11),(k1,v12),(k2,v21),(k2,v22)} to {(k1,[v11,v12]), (k2,[v21,v22])} ; or

B. strip out either the keys or values by calling keys() or values() followed by distinct()

As of this writing (June 2015) UC Berkeley + EdX is running a free online course Introduction to Big Data and Apache Spark which would provide hands on practice with these functions.

0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...