I am using Cartesian transformation in Spark Scala. If my input consists of 4 elements (could be numbers/characters/tuple) say
var myRDD=sc.parallelize(Array("e1","e2","e3","e4"))
myRDD.cartesian(myRDD) would yield all possible combination of pairs but not necessarily in order. What is a smart way to get those pairs in Order? i.e.
Array((e1,e1), (e1,e2), (e1,e3), (e1,e4), (e2,e1), (e2,e2), (e2,e3), (e2,e4), (e3,e1), (e3,e2), (e3,e3), (e3,e4), (e4,e1), (e4,e2), (e4,e3), (e4,e4))
If what you need is to be able to identify each point (so you can determine the pair of points and their L2
distance), thus what you really require is to add an id
to each entry in the RDD
or DataFrame
.
If you want to use an RDD
, the approach I recommend is:
myRDD = sc.parallelize([(0, (0.0, 0.0)), (1, (2.0, 0.0)),
(2, (-3.0, 2.0)), (3, (-6.0, -4.0))])
combinations = myRDD.cartesian(myRDD).coalesce(32)
distances = combinations\
.filter(lambda (x, y): x[0] < y[0])\
.map(lambda ((id1, (x1, y1)), (id2, (x2, y2))): (id1, id2, ((x1 - x2) ** 2 + (y1 - y2) ** 2) ** 0.5))
distances.collect()
Have you tried the sorted
function? Seems to sort Tuples by its first member, then by second and so on:
scala> val a = Array((1, 1), (3, 3), (2, 2))
a: Array[(Int, Int)] = Array((1,1), (3,3), (2,2))
scala> a.sorted
res1: Array[(Int, Int)] = Array((1,1), (2,2), (3,3))
scala> val a = Array((1, 2), (3, 1), (2, 3))
a: Array[(Int, Int)] = Array((1,2), (3,1), (2,3))
scala> a.sorted
res2: Array[(Int, Int)] = Array((1,2), (2,3), (3,1))
scala> val a = Array((1, 2), (3, 1), (1, 1))
a: Array[(Int, Int)] = Array((1,2), (3,1), (1,1))
scala> a.sorted
res3: Array[(Int, Int)] = Array((1,1), (1,2), (3,1))
来源:https://stackoverflow.com/questions/33660998/explicit-sort-in-cartesian-transformation-in-scala-spark