How to reference a dataframe when in an UDF on another dataframe?

后端 未结 2 1966
天涯浪人
天涯浪人 2021-01-06 02:36

How do you reference a pyspark dataframe when in the execution of an UDF on another dataframe?

Here\'s a dummy example. I am creating two dataframes scores

2条回答
  •  温柔的废话
    2021-01-06 03:14

    Changing pair to dictionary for easy lookup of names

    data2 = {}
    for i in range(len(student_ids)):
        data2[student_ids[i]] = last_name[i]
    

    Instead of creating rdd and making it to df create broadcast variable

    //rdd = sc.parallelize(data2) 
    //lastnames = sqlCtx.createDataFrame(rdd, schema)
    lastnames = sc.broadcast(data2)  
    

    Now access this in udf with values attr on broadcast variable(lastnames).

    from pyspark.sql.functions import udf
    def getLastName(sid):
        return lastnames.value[sid]
    

提交回复
热议问题