Difference between two rows in Spark dataframe

后端 未结 3 1614
我在风中等你
我在风中等你 2020-12-03 06:38

I created a dataframe in Spark, by groupby column1 and date and calculated the amount.

val table = df1.groupBy($\"column1\",$\"date\").sum(\"amount\")
         


        
3条回答
  •  [愿得一人]
    2020-12-03 07:07

    Assumming those two dates belong to each group of your table

    my imports :

    import org.apache.spark.sql.functions.{concat_ws,collect_list,lit}
    

    Perpare the dataframe

    scala> val seqRow = Seq(
     | ("A","1- jul",1000),
     | ("A","1-june",2000),
     | ("A","1-May",2000),
     | ("A","1-dec",3000),
     | ("B","1-jul",100),
     | ("B","1-june",300),
     | ("B","1-May",400),
     | ("B","1-dec",300))
    
    seqRow: Seq[(String, String, Int)] = List((A,1- jul,1000), (A,1-june,2000), (A,1-May,2000), (A,1-dec,3000), (B,1-jul,100), (B,1-june,300), (B,1-May,400), (B,1-dec,300))
    
    scala> val input_df = sc.parallelize(seqRow).toDF("column1","date","amount")
    input_df: org.apache.spark.sql.DataFrame = [column1: string, date: string ... 1 more field]
    

    Now write a UDF for your case,

    scala> def calc_diff = udf((list : Seq[String],startMonth : String,endMonth : String) => {
         |     //get the month and their values
         |     val monthMap = list.map{str =>
         |     val splitText = str.split("\\$")
         |     val month = splitText(0).split("-")(1).trim
         |
         |         (month.toLowerCase,splitText(1).toInt)
         |     }.toMap
         |
         |     val stMnth = monthMap(startMonth)
         |     val endMnth = monthMap(endMonth)
         |     endMnth - stMnth
         |
         | })
    calc_diff: org.apache.spark.sql.expressions.UserDefinedFunction
    

    Now, Preparing the output

    scala> val (month1 : String,month2 : String) = ("jul","dec")
    month1: String = jul
    month2: String = dec
    
    scala> val req_df = group_df.withColumn("diff",calc_diff('collect_val,lit(month1.toLowerCase),lit(month2.toLowerCase)))
    req_df: org.apache.spark.sql.DataFrame = [column1: string, sum_amount: bigint ... 2 more fields]
    
    scala> val req_df = group_df.withColumn("diff",calc_diff('collect_val,lit(month1.toLowerCase),lit(month2.toLowerCase))).drop('collect_val)
    req_df: org.apache.spark.sql.DataFrame = [column1: string, sum_amount: bigint ... 1 more field]
    
    scala> req_df.orderBy('column1).show
    +-------+----------+----+
    |column1|sum_amount|diff|
    +-------+----------+----+
    |      A|      8000|2000|
    |      B|      1100| 200|
    +-------+----------+----+
    

    Hope, this is what you want.

提交回复
热议问题