How to get other columns when using Spark DataFrame groupby?

前端 未结 7 1705
甜味超标
甜味超标 2020-11-29 22:04

when I use DataFrame groupby like this:

df.groupBy(df(\"age\")).agg(Map(\"id\"->\"count\"))

I will only get a DataFrame with columns \"a

相关标签:
7条回答
  • 2020-11-29 22:09

    Long story short in general you have to join aggregated results with the original table. Spark SQL follows the same pre-SQL:1999 convention as most of the major databases (PostgreSQL, Oracle, MS SQL Server) which doesn't allow additional columns in aggregation queries.

    Since for aggregations like count results are not well defined and behavior tends to vary in systems which supports this type of queries you can just include additional columns using arbitrary aggregate like first or last.

    In some cases you can replace agg using select with window functions and subsequent where but depending on the context it can be quite expensive.

    0 讨论(0)
  • 2020-11-29 22:09

    May be this solution will helpfull.

    from pyspark.sql import SQLContext
    from pyspark import SparkContext, SparkConf
    from pyspark.sql import functions as F
    from pyspark.sql import Window
    
        name_list = [(101, 'abc', 24), (102, 'cde', 24), (103, 'efg', 22), (104, 'ghi', 21),
                     (105, 'ijk', 20), (106, 'klm', 19), (107, 'mno', 18), (108, 'pqr', 18),
                     (109, 'rst', 26), (110, 'tuv', 27), (111, 'pqr', 18), (112, 'rst', 28), (113, 'tuv', 29)]
    
    age_w = Window.partitionBy("age")
    name_age_df = sqlContext.createDataFrame(name_list, ['id', 'name', 'age'])
    
    name_age_count_df = name_age_df.withColumn("count", F.count("id").over(age_w)).orderBy("count")
    name_age_count_df.show()
    

    Output:

    +---+----+---+-----+
    | id|name|age|count|
    +---+----+---+-----+
    |109| rst| 26|    1|
    |113| tuv| 29|    1|
    |110| tuv| 27|    1|
    |106| klm| 19|    1|
    |103| efg| 22|    1|
    |104| ghi| 21|    1|
    |105| ijk| 20|    1|
    |112| rst| 28|    1|
    |101| abc| 24|    2|
    |102| cde| 24|    2|
    |107| mno| 18|    3|
    |111| pqr| 18|    3|
    |108| pqr| 18|    3|
    +---+----+---+-----+
    
    0 讨论(0)
  • 2020-11-29 22:10

    One way to get all columns after doing a groupBy is to use join function.

    feature_group = ['name', 'age']
    data_counts = df.groupBy(feature_group).count().alias("counts")
    data_joined = df.join(data_counts, feature_group)
    

    data_joined will now have all columns including the count values.

    0 讨论(0)
  • 2020-11-29 22:10

    Here an example that I came across in spark-workshop

    val populationDF = spark.read
                    .option("infer-schema", "true")
                    .option("header", "true")
                    .format("csv").load("file:///databricks/driver/population.csv")
                    .select('name, regexp_replace(col("population"), "\\s", "").cast("integer").as("population"))
    

    val maxPopulationDF = populationDF.agg(max('population).as("populationmax"))

    To get other columns, I do a simple join between the original DF and the aggregated one

    populationDF.join(maxPopulationDF,populationDF.col("population") === maxPopulationDF.col("populationmax")).select('name, 'populationmax).show()
    
    0 讨论(0)
  • You can do like this :

    Sample data:

    name    age id
    abc     24  1001
    cde     24  1002
    efg     22  1003
    ghi     21  1004
    ijk     20  1005
    klm     19  1006
    mno     18  1007
    pqr     18  1008
    rst     26  1009
    tuv     27  1010
    pqr     18  1012
    rst     28  1013
    tuv     29  1011
    
    df.select("name","age","id").groupBy("name","age").count().show();
    

    Output:

        +----+---+-----+
        |name|age|count|
        +----+---+-----+
        | efg| 22|    1|
        | tuv| 29|    1|
        | rst| 28|    1|
        | klm| 19|    1|
        | pqr| 18|    2|
        | cde| 24|    1|
        | tuv| 27|    1|
        | ijk| 20|    1|
        | abc| 24|    1|
        | mno| 18|    1|
        | ghi| 21|    1|
        | rst| 26|    1|
        +----+---+-----+
    
    0 讨论(0)
  • 2020-11-29 22:31

    You need to remember that aggregate functions reduce the rows and therefore you need to specify which of the rows name you want with a reducing function. If you want to retain all rows of a group (warning! this can cause explosions or skewed partitions) you can collect them as a list. You can then use a UDF (user defined function) to reduce them by your criteria, in my example money. And then expand columns from the single reduced row with another UDF . For the purpose of this answer I assume you wish to retain the name of the person who has the most money.

    import org.apache.spark.sql._
    import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
    import org.apache.spark.sql.functions._
    import org.apache.spark.sql.types.StringType
    
    import scala.collection.mutable
    
    
    object TestJob3 {
    
    def main (args: Array[String]): Unit = {
    
    val sparkSession = SparkSession
      .builder()
      .appName(this.getClass.getName.replace("$", ""))
      .master("local")
      .getOrCreate()
    
    val sc = sparkSession.sparkContext
    
    import sparkSession.sqlContext.implicits._
    
    val rawDf = Seq(
      (1, "Moe",  "Slap",  2.0, 18),
      (2, "Larry",  "Spank",  3.0, 15),
      (3, "Curly",  "Twist", 5.0, 15),
      (4, "Laurel", "Whimper", 3.0, 9),
      (5, "Hardy", "Laugh", 6.0, 18),
      (6, "Charley",  "Ignore",   5.0, 5)
    ).toDF("id", "name", "requisite", "money", "age")
    
    rawDf.show(false)
    rawDf.printSchema
    
    val rawSchema = rawDf.schema
    
    val fUdf = udf(reduceByMoney, rawSchema)
    
    val nameUdf = udf(extractName, StringType)
    
    val aggDf = rawDf
      .groupBy("age")
      .agg(
        count(struct("*")).as("count"),
        max(col("money")),
        collect_list(struct("*")).as("horizontal")
      )
      .withColumn("short", fUdf($"horizontal"))
      .withColumn("name", nameUdf($"short"))
      .drop("horizontal")
    
    aggDf.printSchema
    
    aggDf.show(false)
    
    }
    
    def reduceByMoney= (x: Any) => {
    
    val d = x.asInstanceOf[mutable.WrappedArray[GenericRowWithSchema]]
    
    val red = d.reduce((r1, r2) => {
    
      val money1 = r1.getAs[Double]("money")
      val money2 = r2.getAs[Double]("money")
    
      val r3 = money1 match {
        case a if a >= money2 =>
          r1
        case _ =>
          r2
      }
    
      r3
    })
    
    red
    }
    
    def extractName = (x: Any) => {
    
      val d = x.asInstanceOf[GenericRowWithSchema]
    
      d.getAs[String]("name")
    }
    }
    

    here is the output

    +---+-----+----------+----------------------------+-------+
    |age|count|max(money)|short                       |name   |
    +---+-----+----------+----------------------------+-------+
    |5  |1    |5.0       |[6, Charley, Ignore, 5.0, 5]|Charley|
    |15 |2    |5.0       |[3, Curly, Twist, 5.0, 15]  |Curly  |
    |9  |1    |3.0       |[4, Laurel, Whimper, 3.0, 9]|Laurel |
    |18 |2    |6.0       |[5, Hardy, Laugh, 6.0, 18]  |Hardy  |
    +---+-----+----------+----------------------------+-------+
    
    0 讨论(0)
提交回复
热议问题