How do you reference a pyspark dataframe when in the execution of an UDF on another dataframe?
Here\'s a dummy example. I am creating two dataframes scores
Changing pair to dictionary for easy lookup of names
data2 = {}
for i in range(len(student_ids)):
data2[student_ids[i]] = last_name[i]
Instead of creating rdd
and making it to df
create broadcast variable
//rdd = sc.parallelize(data2)
//lastnames = sqlCtx.createDataFrame(rdd, schema)
lastnames = sc.broadcast(data2)
Now access this in udf with values
attr on broadcast variable(lastnames
).
from pyspark.sql.functions import udf
def getLastName(sid):
return lastnames.value[sid]