How does Distinct() function work in Spark?

后端 未结 5 970
后悔当初
后悔当初 2020-12-02 15:43

I\'m a newbie to Apache Spark and was learning basic functionalities. Had a small doubt.Suppose I have an RDD of tuples (key, value) and wanted to obtain some unique ones ou

5条回答
  •  庸人自扰
    2020-12-02 16:36

    The API docs for RDD.distinct() only provide a one sentence description:

    "Return a new RDD containing the distinct elements in this RDD."

    From recent experience I can tell you that in a tuple-RDD the tuple as a whole is considered.

    If you want distinct keys or distinct values, then depending on exactly what you want to accomplish, you can either:

    A. call groupByKey() to transform {(k1,v11),(k1,v12),(k2,v21),(k2,v22)} to {(k1,[v11,v12]), (k2,[v21,v22])} ; or

    B. strip out either the keys or values by calling keys() or values() followed by distinct()

    As of this writing (June 2015) UC Berkeley + EdX is running a free online course Introduction to Big Data and Apache Spark which would provide hands on practice with these functions.

提交回复
热议问题