Problems on Spark dealing with list of Python object

那年仲夏 提交于 2020-01-07 07:10:14

问题


I am learning Spark, and I just got a problem when I used Spark to deal with a list of Python object. The following is my code:

import numpy as np    
from pyspark import SparkConf, SparkContext

### Definition of Class A
class A:
    def __init__(self, n):
        self.num = n

### Function "display"
def display(s):
    print s.num
    return s

def main():
    ### Initialize the Spark
    conf = SparkConf().setAppName("ruofan").setMaster("local")
    sc = SparkContext(conf = conf)

    ### Create a list of instances of Class A
    data = []
    for i in np.arange(5):
        x = A(i)
        data.append(x)

    ### Use Spark to parallelize the list of instances
    lines = sc.parallelize(data)

    ### Spark mapping
    lineLengths1 = lines.map(display)

if __name__ == "__main__":
    main()

When I run my code, it seemed not printing the number of each instance (But it should have printed 0, 1, 2, 3, 4). I try to find the reasons, but I have no ideas on this. I would really appreciate if anyone help me.


回答1:


First of all display is never executed. RDDs are lazily evaluated so as long you don't perform an action (like collect, count or saveAsTextFile) nothing really happens.

Another part of the problem requires an understanding of Spark architecture. Simplifying things a little bit Driver program is responsible for SparkContext creation and sending tasks to the Worker Nodes. Everything that happens during transformations (in your case map) is executed on the Workers so the output of the print statement goes to the Worker stdout. If you want to obtain some kind of output you should consider using logs instead.

Finally if your goal is to get some kind of side effect it would be idiomatic to use foreach instead of map.



来源:https://stackoverflow.com/questions/31255950/problems-on-spark-dealing-with-list-of-python-object

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!