How to transform Scala nested map operation to Scala Spark operation?

后端 未结 2 1821
情话喂你
情话喂你 2021-01-22 11:18

Below code calculates eucleudian distance between two List in a dataset :

 val user1 = List(\"a\", \"1\", \"3\", \"2\", \"6\", \"9\")  //> user1  : List[Stri         


        
2条回答
  •  没有蜡笔的小新
    2021-01-22 11:55

    The actual solution will depend on the dimensions of the dataset. Assuming that the original dataset fits in memory and you want to parallelize the computation of the euclidean distance, I'd proceed like this:

    Assume users is the list of users by some id and userData is the data to be processed per user indexed by id.

    // sc is the Spark Context
    type UserId = String
    type UserData = Array[Double]
    
    val users: List[UserId]= ???
    val data: Map[UserId,UserData] = ???
    // combination generates the unique pairs of users for which distance makes sense
    // given that euclidDistance (a,b) = eclidDistance(b,a) only (a,b) is in this set
    def combinations[T](l: List[T]): List[(T,T)] = l match {
        case Nil => Nil
        case h::Nil => Nil
        case h::t => t.map(x=>(h,x)) ++ comb(t)
    }
    
    // broadcasts the data to all workers
    val broadcastData = sc.broadcast(data)
    val usersRdd = sc.parallelize(combinations(users))
    val euclidDistance: (UserData, UserData) => Double = (x,y) => 
        math.sqrt((x zip y).map{case (a,b) => math.pow(a-b,2)}.sum)
    val userDistanceRdd = usersRdd.map{ case (user1, user2) => {
            val data = broadcastData.value
            val distance = euclidDistance(data(user1), data(user2))
            ((user1, user2),distance)
        }
    

    In case that the user data is too large, instead of using a broadcast variable, you would load that from external storage.

提交回复
热议问题