Spark rdd write in global list

百般思念 提交于 2019-12-02 13:03:27
vvladymyrov

The reason why you get Li value set to [] after executing mapValues - is because Spark serializes Fn function (and all global variables that it references - it is called closure) and sends to an another machine - worker.

But there is no exactly corresponding mechanism for sending results with closures back from worker to driver.

In order to receive results - you need to return from your function and use action like take() or collect(). But be careful - you don't want to send back more data than can fit into driver's memory - otherwise Spark app will throw out of memory exception.

Also you have not executed action on your RDD mapValues transformation - so in your example no task were executed on workers.

rdd = sc.parallelize([(x, x+1) for x in range(2, 5)])

def Fn(value):
    return value*2

Li = rdd.mapValues(lambda x:Fn(x)).collect()

print Li

would result in

[(2, 6), (3, 8), (4, 10)]

Edi

Following your problem description (based on my understanding of what you want to do):

L1 = range(20)
rdd = sc.parallelize(L1)

L2 = rdd.filter(lambda x: x % 2==0).collect()

print L2
>>> [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!