Pyspark - How to get basic stats (mean, min, max) along with quantiles (25%, 50%) for numerical cols in a single dataframe

问题

I have a spark df

spark_df = spark.createDataFrame(
    [(1, 7, 'foo'), 
     (2, 6, 'bar'),
     (3, 4, 'foo'),
     (4, 8, 'bar'),
     (5, 1, 'bar')
    ],
    ['v1', 'v2', 'id'] 
)

Expected Output

    id  avg(v1)   avg(v2)   min(v1) min(v2) 0.25(v1)     0.25(v2)    0.5(v1)     0.5(v2)
0   bar 3.666667    5.0     2        1       some-value  some-value  some-value  some-value
1   foo 2.000000    5.5     1        4.      some-value  some-value  some-value  some-value

Until, now I can achieve the basic stats like avg, min, max. But not able to get the quantiles. I know ,this can be achieved easily in Pandas but not able to get it done in Pyspark

Also, I knew about approxQuantile, but I am not able to combine basic stasts along with quantiles in pyspark

Until now, I can get the basic stats like mean and min by using agg. Also I want quantiles in the same df

func = [F.mean, F.min,]
NUMERICAL_FEATURE_LIST = ['v1', 'v2']
GROUP_BY_FIELDS = ['id']
exp = [f(F.col(c)) for f in func for c in NUMERICAL_FEATURE_LIST]
df_fin = spark_df.groupby(*GROUP_BY_FIELDS).agg(*exp)

回答1:

Perhaps this is helpful-

 val spark_df = Seq((1, 7, "foo"),
      (2, 6, "bar"),
      (3, 4, "foo"),
      (4, 8, "bar"),
      (5, 1, "bar")
    ).toDF("v1", "v2", "id")
    spark_df.show(false)
    spark_df.printSchema()
    spark_df.summary() // default= "count", "mean", "stddev", "min", "25%", "50%", "75%", "max"
      .show(false)

    /**
      * +---+---+---+
      * |v1 |v2 |id |
      * +---+---+---+
      * |1  |7  |foo|
      * |2  |6  |bar|
      * |3  |4  |foo|
      * |4  |8  |bar|
      * |5  |1  |bar|
      * +---+---+---+
      *
      * root
      * |-- v1: integer (nullable = false)
      * |-- v2: integer (nullable = false)
      * |-- id: string (nullable = true)
      *
      * +-------+------------------+------------------+----+
      * |summary|v1                |v2                |id  |
      * +-------+------------------+------------------+----+
      * |count  |5                 |5                 |5   |
      * |mean   |3.0               |5.2               |null|
      * |stddev |1.5811388300841898|2.7748873851023217|null|
      * |min    |1                 |1                 |bar |
      * |25%    |2                 |4                 |null|
      * |50%    |3                 |6                 |null|
      * |75%    |4                 |7                 |null|
      * |max    |5                 |8                 |foo |
      * +-------+------------------+------------------+----+
      */

if you need in the format, then use below answer.

回答2:

I think a syntax like this is what you're looking for:

spark.createOrRegisterTempTable("spark_table")    
spark.sql("SELECT id, AVG(v1) AS avg_v1, AVG(v2) AS avg_v2, \
 MIN(v1) AS min_v1, MIN(v2) AS min_v2, \
 percentile_approx(v1, 0.25) AS p25_v1, percentile_approx(v2, 0.25) AS p25_v2, \
 percentile_approx(v1, 0.5)AS p50_v1, percentile_approx(v2, 0.5) AS p50_v2 \
 FROM spark_table GROUP BY id").show(5)

It helps to create aliases because unformatted column names are a pain to work with.

回答3:

Method describe computes the statistics like mean, min, max etc for the numeric columns in the dataframe.

df.describe().show()

来源：https://stackoverflow.com/questions/62366103/pyspark-how-to-get-basic-stats-mean-min-max-along-with-quantiles-25-50

标签

apache-spark

pyspark

aggregate-functions