pyspark how to return the average of a column based on the value of another column?

问题

I wouldn't expect this to be difficult, but I'm having trouble understanding how to take the average of a column in my spark dataframe.

The dataframe looks like:

+-------+------------+--------+------------------+
|Private|Applications|Accepted|              Rate|
+-------+------------+--------+------------------+
|    Yes|         417|     349|0.8369304556354916|
|    Yes|        1899|    1720|0.9057398630858347|
|    Yes|        1732|    1425|0.8227482678983834|
|    Yes|         494|     313|0.6336032388663968|
|     No|        3540|    2001|0.5652542372881356|
|     No|        7313|    4664|0.6377683577191303|
|    Yes|         619|     516|0.8336025848142165|
|    Yes|         662|     513|0.7749244712990937|
|    Yes|         761|     725|0.9526938239159002|
|    Yes|        1690|    1366| 0.808284023668639|
|    Yes|        6075|    5349|0.8804938271604938|
|    Yes|         632|     494|0.7816455696202531|
|     No|        1208|     877|0.7259933774834437|
|    Yes|       20192|   13007|0.6441660063391442|
|    Yes|        1436|    1228|0.8551532033426184|
|    Yes|         392|     351|0.8954081632653061|
|    Yes|       12586|    3239|0.2573494358811378|
|    Yes|        1011|     604|0.5974282888229476|
|    Yes|         848|     587|0.6922169811320755|
|    Yes|        8728|    5201|0.5958982584784601|
+-------+------------+--------+------------------+

I want to return the average of the Rate column when Private is equal to "Yes". How can I do this?

回答1:

A third version to do the same would be:

from pyspark.sql.functions import col, avg
df_avg = df.filter(df["Private"] == "Yes").agg(avg(col("Rate")))
df_avg.show()

回答2:

This would work in scala. pyspark code should be very similar.

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

val df = List(
("yes", 10),
("yes", 30),
("No", 40)).toDF("private", "rate")

val df = l.toDF(List("private", "rate"))

val window =Window.partitionBy($"private")

df.
    withColumn("avg", 
                when($"private" === "No", null).
                otherwise(avg($"rate").over(window))
            ).
    show()

Input DF

+-------+----+
|private|rate|
+-------+----+
|    yes|  10|
|    yes|  30|
|     No|  40|
+-------+----+

output df

+-------+----+----+
|private|rate| avg|
+-------+----+----+
|     No|  40|null|
|    yes|  10|20.0|
|    yes|  30|20.0|
+-------+----+----+

回答3:

Try

df.filter(df['Private'] == 'Yes').agg({'Rate': 'avg'}).collect()[0]

回答4:

Try:

from pyspark.sql.functions import col, mean, lit

df.where(col("Private")==lit("Yes")).select(mean(col("Rate"))).collect()

来源：https://stackoverflow.com/questions/60139613/pyspark-how-to-return-the-average-of-a-column-based-on-the-value-of-another-colu

标签

python

dataframe

apache-spark

pyspark