I would like append a new column on dataframe \"df\" from function get_distance
:
def get_distance(x, y):
You cannot use Python function on a Column
objects directly, unless it is intended to operate on Column
objects / expressions. You need udf
for that:
@udf
def get_distance(x, y):
...
But you cannot use SQLContext
in udf (or mapper in general).
Just join
:
tab = hiveContext.table("tab").groupBy("column1", "column2").agg(first("column3"))
df.join(tab, ["column1", "column2"])