Pandas-style transform of grouped data on PySpark DataFrame

回眸只為那壹抹淺笑 提交于 2019-12-21 03:57:05

问题


If we have a Pandas data frame consisting of a column of categories and a column of values, we can remove the mean in each category by doing the following:

df["DemeanedValues"] = df.groupby("Category")["Values"].transform(lambda g: g - numpy.mean(g))

As far as I understand, Spark dataframes do not directly offer this group-by/transform operation (I am using PySpark on Spark 1.5.0). So, what is the best way to implement this computation?

I have tried using a group-by/join as follows:

df2 = df.groupBy("Category").mean("Values")
df3 = df2.join(df)

But it is very slow since, as I understand, each category requires a full scan of the DataFrame.

I think (but have not verified) that I can speed this up a great deal if I collect the result of the group-by/mean into a dictionary, and then use that dictionary in a UDF as follows:

nameToMean = {...}
f = lambda category, value: value - nameToMean[category]
categoryDemeaned = pyspark.sql.functions.udf(f, pyspark.sql.types.DoubleType())
df = df.withColumn("DemeanedValue", categoryDemeaned(df.Category, df.Value))

Is there an idiomatic way to express this type of operation without sacrificing performance?


回答1:


I understand, each category requires a full scan of the DataFrame.

No it doesn't. DataFrame aggregations are performed using a logic similar to aggregateByKey. See DataFrame groupBy behaviour/optimization A slower part is join which requires sorting / shuffling. But it still doesn't require scan per group.

If this is an exact code you use it is slow because you don't provide a join expression. Because of that it simply performs a Cartesian product. So it is not only inefficient but also incorrect. You want something like this:

from pyspark.sql.functions import col

means = df.groupBy("Category").mean("Values").alias("means")
df.alias("df").join(means, col("df.Category") == col("means.Category"))

I think (but have not verified) that I can speed this up a great deal if I collect the result of the group-by/mean into a dictionary, and then use that dictionary in a UDF

It is possible although performance will vary on case by case basis. A problem with using Python UDFs is that it has to move data to and from Python. Still, it is definitely worth trying. You should consider using a broadcast variable for nameToMean though.

Is there an idiomatic way to express this type of operation without sacrificing performance?

In PySpark 1.6 you can use broadcast function:

df.alias("df").join(
    broadcast(means), col("df.Category") == col("means.Category"))

but it is not available in <= 1.5.




回答2:


Actually, there is an idiomatic way to do this in Spark, using the Hive OVER expression.

i.e.

df.registerTempTable('df')
with_category_means = sqlContext.sql('select *, mean(Values) OVER (PARTITION BY Category) as category_mean from df')

Under the hood, this is using a window function. I'm not sure if this is faster than your solution, though




回答3:


You can use Window to do this

i.e.

import pyspark.sql.functions as F
from pyspark.sql.window import Window

window_var = Window().partitionBy('Categroy')
df = df.withColumn('DemeanedValues', F.col('Values') - F.mean('Values').over(window_var))


来源:https://stackoverflow.com/questions/34464577/pandas-style-transform-of-grouped-data-on-pyspark-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!