Custom function over pyspark dataframe

孤人 提交于 2021-01-28 07:00:30

问题


I'm trying to apply a custom function over rows in a pyspark dataframe. This function takes the row and 2 other vectors of the same dimension. It outputs the sum of the values of the third vector for each matching values from the row in the second vector.

import pandas as pd
import numpy as np

Function:

def V_sum(row,b,c):
    return float(np.sum(c[row==b]))

What I want to achieve is simple with pandas:

pd_df = pd.DataFrame([[0,1,0,0],[1,1,0,0],[0,0,1,0],[1,0,1,1],[1,1,0,0]], columns=['t1', 't2', 't3', 't4'])
   t1  t2  t3  t4
0   0   1   0   0
1   1   1   0   0
2   0   0   1   0
3   1   0   1   1
4   1   1   0   0

B = np.array([1,0,1,0])
V = np.array([5,1,2,4])

pd_df.apply(lambda x: V_sum(x, B, V), axis=1)
0    4.0
1    9.0
2    7.0
3    8.0
4    9.0
dtype: int64

I would like to perform the same action in pyspark.

from pyspark import SparkConf, SparkContext, SQLContext
sc = SparkContext("local")
sqlContext = SQLContext(sc)

spk_df = sqlContext.createDataFrame([[0,1,0,0],[1,1,0,0],[0,0,1,0],[1,0,1,1],[1,1,0,0]], ['t1', 't2', 't3', 't4'])
spk_df.show()
+---+---+---+---+
| t1| t2| t3| t4|
+---+---+---+---+
|  0|  1|  0|  0|
|  1|  1|  0|  0|
|  0|  0|  1|  0|
|  1|  0|  1|  1|
|  1|  1|  0|  0|
+---+---+---+---+

I thought about using udf but I can't get it to work

from pyspark.sql.types import FloatType
import pyspark.sql.functions as F

V_sum_udf = F.udf(V_sum, FloatType()) 
spk_df.select(V_sum_udf(F.array(*(F.col(x) for x in spk_df.columns))).alias("results")).show()

Clearly I'm doing something wrong because it yields:

Py4JJavaError: An error occurred while calling o27726.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 90.0 failed 1 times, most recent failure: Lost task 0.0 in stage 90.0 (TID 91, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):

回答1:


If you've got non-column data that you want to use inside a function along with column data to compute a new column, a UDF + closure + withColumn as described here is a good place to start.

B = [2,0,1,0] 
V = [5,1,2,4]

v_sum_udf = F.udf(lambda row: V_sum(row, B, V), FloatType())
spk_df.withColumn("results", v_sum_udf(F.array(*(F.col(x) for x in spk_df.columns))))


来源:https://stackoverflow.com/questions/47597711/custom-function-over-pyspark-dataframe

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!