Spark __getnewargs__ error

前端 未结 1 1676
执笔经年
执笔经年 2020-12-19 02:53

I am trying to clean a Spark DataFrame by mapping it to RDD then back to DataFrame. Here\'s a toy example:

def replace_values(row,sub_rules):
    d = row.asD         


        
相关标签:
1条回答
  • 2020-12-19 03:01

    The error is caused by the reference to ex.columns in the .map(lambda...) statement. You can't have references to an RDD inside the function being used in an RDD transformation. Spark is supposed to issue more helpful errors in this case, but apparently that didn't make it into this version.

    Solution is to replace references with copies of the referenced variables:

    def replace_values(row,sub_rules):
        d = row.asDict()
        for col,old_val,new_val in sub_rules:
            if d[col] == old_val:
                d[col] = new_val      
        return Row(**d)
    ex = sc.parallelize([{'name': 'Alice', 'age': 1},{'name': 'Bob', 'age': 2}])
    ex = sqlContext.createDataFrame(ex)
    cols = copy.deepcopy(ex.columns)
    (ex.map(lambda row: replace_values(row,[(col,1,3) for col in cols]))
        .toDF(schema=ex.schema))
    
    0 讨论(0)
提交回复
热议问题