I\'m a newbie to Apache Spark and was learning basic functionalities. Had a small doubt.Suppose I have an RDD of tuples (key, value) and wanted to obtain some unique ones ou
The API docs for RDD.distinct() only provide a one sentence description:
"Return a new RDD containing the distinct elements in this RDD."
From recent experience I can tell you that in a tuple-RDD the tuple as a whole is considered.
If you want distinct keys or distinct values, then depending on exactly what you want to accomplish, you can either:
A. call groupByKey() to transform {(k1,v11),(k1,v12),(k2,v21),(k2,v22)} to {(k1,[v11,v12]), (k2,[v21,v22])} ; or
B. strip out either the keys or values by calling keys() or values() followed by distinct()
As of this writing (June 2015) UC Berkeley + EdX is running a free online course Introduction to Big Data and Apache Spark which would provide hands on practice with these functions.