I\'m a newbie to Apache Spark and was learning basic functionalities. Had a small doubt.Suppose I have an RDD of tuples (key, value) and wanted to obtain some unique ones ou
distinct
uses the hashCode
and equals
method of the objects for this determination. Tuples come built in with the equality mechanisms delegating down into the equality and position of each object. So, distinct
will work against the entire Tuple2
object. As Paul pointed out, you can call keys
or values
and then distinct
. Or you can write your own distinct values via aggregateByKey
, which would keep the key pairing. Or if you want the distinct keys, then you could use a regular aggregate