Split a row into two and dummy some columns

问题

I need to split a row and create a new row by changing the date columns and make the amt columns to zero as in the below example:

Input:  
+---+-----------------------+-----------------------+-----+
|KEY|START_DATE             |END_DATE               |Amt  |
+---+-----------------------+-----------------------+-----+
|0  |2016-12-14T23:59:59.000|2017-10-29T23:59:58.000|100.0|
|0  |2016-12-14T23:59:59.000|2017-10-29T23:59:58.000|200.0|
|0  |2017-10-30T00:00:00.000|2017-11-02T23:59:59.000|67.5 |->Split row based on start & date end date is between "2017-10-31T23:59:59" condition
|0  |2016-12-14T23:59:59.000|2017-10-29T23:59:58.000|55.3 |
|1  |2016-12-14T23:59:59.000|2017-10-29T23:59:58.000|22.2 |
|1  |2017-10-30T00:00:00.000|2017-11-01T23:59:59.000|11.0 |->Split row based on start & date end date is between "2017-10-31T23:59:59" condition
|1  |2016-12-14T23:59:59.000|2017-10-29T23:59:58.000|87.33|
+---+-----------------------+-----------------------+-----+

If "2017-10-31T23:59:59" is in between row start_date and end_date , then split the row into two rows by changing the end_date for one row and start_date for another row. And make the amt to zero for the new row as below:

Desired Output:

+---+-----------------------+-----------------------+-----+---+
|KEY|START_DATE             |END_DATE               |Amt  |Ind|
+---+-----------------------+-----------------------+-----+---+
|0  |2016-12-14T23:59:59.000|2017-10-29T23:59:58.000|100.0|N  |
|0  |2016-12-14T23:59:59.000|2017-10-29T23:59:58.000|200.0|N  |

|0  |2017-10-30T00:00:00.000|2017-10-30T23:59:59.998|67.5 |N  |->parent row (changed the END_DATE)     
|0  |2017-10-30T23:59:59.999|2017-11-02T23:59:59.000|0.0  |Y  |->splitted new row(changed the START_DATE and Amt=0.0)          

|0  |2016-12-14T23:59:59.000|2017-10-29T23:59:58.000|55.3 |N  |     
|1  |2016-12-14T23:59:59.000|2017-10-29T23:59:58.000|22.2 |N  |

|1  |2017-10-30T00:00:00.000|2017-10-30T23:59:59.998|11.0 |N  |->parent row (changed the END_DATE)    
|1  |2017-10-30T23:59:59.999|2017-11-01T23:59:59.000|0.0  |Y  |->splitted new row(changed the START_DATE and Amt=0.0)     

|1  |2016-12-14T23:59:59.000|2017-10-29T23:59:58.000|87.33|N  |     
+---+-----------------------+-----------------------+-----+---+

I have tried the below code and able to copy the row , but unable to update the rows on the fly.

val df1Columns = Seq("KEY", "START_DATE", "END_DATE", "Amt")

  val df1Schema = new StructType(df1Columns.map(c => StructField(c, StringType, nullable = false)).toArray)
  val input1: Array[String] = Seq("0", "2016-12-14T23:59:59.000", "2017-10-29T23:59:58.000", "100.0").toArray;
  val row1: Row = Row.fromSeq(input1)
  val input2: Array[String] = Seq("0", "2016-12-14T23:59:59.000", "2017-10-29T23:59:58.000", "200.0").toArray;
  val row2: Row = Row.fromSeq(input2)
  val input3: Array[String] = Seq("0", "2017-10-30T00:00:00.000", "2017-11-0123:59:59.000", "67.5").toArray;
  val row3: Row = Row.fromSeq(input3)
  val input4: Array[String] = Seq("0", "2016-12-14T23:59:59.000", "2017-10-29T23:59:58.000", "55.3").toArray;
  val row4: Row = Row.fromSeq(input4)
  val input5: Array[String] = Seq("1", "2016-12-14T23:59:59.000", "2017-10-29T23:59:58.000", "22.2").toArray;
  val row5: Row = Row.fromSeq(input5)
  val input6: Array[String] = Seq("1", "2017-10-30T00:00:00.000", "2017-11-0123:59:59.000", "11.0").toArray;
  val row6: Row = Row.fromSeq(input6)
  val input7: Array[String] = Seq("1", "2016-12-14T23:59:59.000", "2017-10-29T23:59:58.000", "87.33").toArray;
  val row7: Row = Row.fromSeq(input7)

  val rdd: RDD[Row] = spark.sparkContext.parallelize(Seq(row1, row2, row3, row4, row5, row6, row7))
  val df: DataFrame = spark.createDataFrame(rdd, df1Schema)

  //----------------------------------------------------------------

def encoder(columns: Seq[String]): Encoder[Row] = RowEncoder(StructType(columns.map(StructField(_, StringType, nullable = true))))
val outputColumns = Seq("KEY", "START_DATE", "END_DATE", "Amt","Ind")

  val result = df.groupByKey(r => r.getAs[String]("KEY"))
    .flatMapGroups((_, rowsForAkey) => {
      var result: List[Row] = List()
      for (row <- rowsForAkey) {
        val qrDate = "2017-10-31T23:59:59"
        val currRowStartDate = row.getAs[String]("START_DATE")
        val rowEndDate = row.getAs[String]("END_DATE")
        if (currRowStartDate <= qrDate && qrDate <= rowEndDate) //Quota
        {
          val rLayer = row
          result = result :+ rLayer
        }
        val originalRow = row
        result = result :+ originalRow
      }
      result
      })(encoder(df1Columns)).toDF

  df.show(false)
  result.show(false)

Here is my code output:

+---+-----------------------+-----------------------+-----+
|KEY|START_DATE             |END_DATE               |Amt  |
+---+-----------------------+-----------------------+-----+
|0  |2016-12-14T23:59:59.000|2017-10-29T23:59:58.000|100.0|     
|0  |2016-12-14T23:59:59.000|2017-10-29T23:59:58.000|200.0|     
|0  |2017-10-30T00:00:00.000|2017-11-0123:59:59.000 |67.5 |     
|0  |2017-10-30T00:00:00.000|2017-11-0123:59:59.000 |67.5 |     
|0  |2016-12-14T23:59:59.000|2017-10-29T23:59:58.000|55.3 |     
|1  |2016-12-14T23:59:59.000|2017-10-29T23:59:58.000|22.2 |     
|1  |2017-10-30T00:00:00.000|2017-11-0123:59:59.000 |11.0 |     
|1  |2017-10-30T00:00:00.000|2017-11-0123:59:59.000 |11.0 |     
|1  |2016-12-14T23:59:59.000|2017-10-29T23:59:58.000|87.33|     
+---+-----------------------+-----------------------+-----+

回答1:

I would suggest you to you to go with inbuilt functions rather than going through such complex rdd way.

I have used inbuilt functions such as lit to populate constants and udf function to change the time in date columns

Main theme is to separate the dataframes into two and finally union them (I have commented for the clarity of the codes)

import org.apache.spark.sql.functions._
//udf function to change the time
def changeTimeInDate = udf((toCopy: String, withCopied: String)=> withCopied.split("T")(0)+"T"+toCopy.split("T")(1))

//creating Ind column with N populated and saving in temporaty dataframe
val indDF = df.withColumn("Ind", lit("N"))

//filtering out the rows that match the condition mentioned in the question and then changing the Amt column and Ind column and START_DATE
val duplicatedDF = indDF.filter($"START_DATE" <= "2017-10-31T23:59:59" && $"END_DATE" >= "2017-10-31T23:59:59")
  .withColumn("Amt", lit("0.0"))
  .withColumn("Ind", lit("Y"))
  .withColumn("START_DATE", changeTimeInDate($"END_DATE", $"START_DATE"))

//Changing the END_DATE and finally merging both
val result = indDF.withColumn("END_DATE", changeTimeInDate($"START_DATE", $"END_DATE"))
  .union(duplicatedDF)

You should have the desired output

+---+-----------------------+-----------------------+-----+---+
|KEY|START_DATE             |END_DATE               |Amt  |Ind|
+---+-----------------------+-----------------------+-----+---+
|0  |2016-12-14T23:59:59.000|2017-10-29T23:59:59.000|100.0|N  |
|0  |2016-12-14T23:59:59.000|2017-10-29T23:59:59.000|55.3 |N  |
|0  |2016-12-14T23:59:59.000|2017-10-29T23:59:59.000|200.0|N  |
|0  |2017-10-30T00:00:00.000|2017-11-01T00:00:00.000|67.5 |N  |
|0  |2017-10-30T23:59:59.000|2017-11-01T23:59:59.000|0.0  |Y  |
|1  |2016-12-14T23:59:59.000|2017-10-29T23:59:59.000|22.2 |N  |
|1  |2016-12-14T23:59:59.000|2017-10-29T23:59:59.000|87.33|N  |
|1  |2017-10-30T00:00:00.000|2017-11-01T00:00:00.000|11.0 |N  |
|1  |2017-10-30T23:59:59.000|2017-11-01T23:59:59.000|0.0  |Y  |
+---+-----------------------+-----------------------+-----+---+

回答2:

It looks like you're duplicating the rows, rather than altering them.

You can replace the inside of your flatMapGroups function with something like:

rowsForAKey.flatMap{ row => 
  val qrDate = "2017-10-31T23:59:59"
  val currRowStartDate = row.getAs[String]("START_DATE")
  val rowEndDate = row.getAs[String]("END_DATE")
  if (currRowStartDate <= qrDate && qrDate <= rowEndDate) //Quota
  {
    val splitDate = endOfDay(currRowStartDate)
    // need to build two rows
    val parentRow = Row(row(0), row(1), splitDate, row(3), "Y")
    val splitRow = Row(row(0), splitDate, row(2), 0.0, "N")
    List(parentRow, splitRow)
  }
  else {
    List(row)
  }
}

Basically, any time you have a for loop building up a list like this in Scala, it's really map or flatMap that you want. Here, it's flatMap since each row will give us either one or two elements in the result. I've assumed you introduce a function endOfDay to make the right timestamp.

I realize you may be reading data in a way that gives you a DataFrame, but I do want to offer the idea of using Dataset[Some Case Class] instead--it'd basically be a drop-in replacement (you're basically viewing your DataFrame as Dataset[Row], which is what it is, after all) and I think it would make things easier to read, plus you'd get type-checking.

Also as a heads up, if you import spark.implicits._, you shouldn't need the encoder--everything looks to be a string or a float and those encoders are available.

来源：https://stackoverflow.com/questions/49047368/split-a-row-into-two-and-dummy-some-columns

标签

scala

apache-spark

apache-spark-sql

scala-collections