Pandas-style transform of grouped data on PySpark DataFrame

后端 未结 3 924
悲&欢浪女
悲&欢浪女 2021-02-07 10:24

If we have a Pandas data frame consisting of a column of categories and a column of values, we can remove the mean in each category by doing the following:

df[\"         


        
3条回答
  •  半阙折子戏
    2021-02-07 10:50

    I understand, each category requires a full scan of the DataFrame.

    No it doesn't. DataFrame aggregations are performed using a logic similar to aggregateByKey. See DataFrame groupBy behaviour/optimization A slower part is join which requires sorting / shuffling. But it still doesn't require scan per group.

    If this is an exact code you use it is slow because you don't provide a join expression. Because of that it simply performs a Cartesian product. So it is not only inefficient but also incorrect. You want something like this:

    from pyspark.sql.functions import col
    
    means = df.groupBy("Category").mean("Values").alias("means")
    df.alias("df").join(means, col("df.Category") == col("means.Category"))
    

    I think (but have not verified) that I can speed this up a great deal if I collect the result of the group-by/mean into a dictionary, and then use that dictionary in a UDF

    It is possible although performance will vary on case by case basis. A problem with using Python UDFs is that it has to move data to and from Python. Still, it is definitely worth trying. You should consider using a broadcast variable for nameToMean though.

    Is there an idiomatic way to express this type of operation without sacrificing performance?

    In PySpark 1.6 you can use broadcast function:

    df.alias("df").join(
        broadcast(means), col("df.Category") == col("means.Category"))
    

    but it is not available in <= 1.5.

提交回复
热议问题