Checking if an RDD element is in another using the map function

耗尽温柔 提交于 2019-12-03 23:09:40

Your implementation tries to use one RDD (ids) inside a closure used to map another - this isn't allowed in Spark applications: anything to be used in a closure must be serializable (and preferably small), since it will be serialized and sent to each worker.

a leftOuterJoin between these RDDs should get you what you want:

val ids = sc.parallelize(List(1,2,10,5))
val vals = sc.parallelize(List((1, 0), (2, 0), (3,0), (4,0)))
val result = vals
        .leftOuterJoin(ids.keyBy(i => i))
        .mapValues({ 
            case (v, Some(matchingId)) => v + 1  // increase value if match found
            case (v, None) => v                  // leave value as-is otherwise
        }) 

The leftOuterJoin expects two key-value RDDs, hence we artificially extract a key from the ids RDD using the identity function. Then we map the values of each resulting (id: Int, (value: Int, matchingId: Option[Int])) record into either v or v+1.

Generally, you should always aim to minimize the use of actions like collect when using Spark, as such actions move data back from the distributed cluster into your driver application.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!