Transposing table to given format in spark [duplicate]

问题

I am using sparkv2.4.1, Have a scenario , where i need to convert given table structred as below

val df = Seq(
  ("A", "2016-01-01", "2016-12-01", "0.044999408"),
("A", "2016-01-01", "2016-12-01", "0.0449999426"),
("A", "2016-01-01", "2016-12-01", "0.045999415"),
("B", "2016-01-01", "2016-12-01", "0.0787888909"),
("B", "2016-01-01", "2016-12-01", "0.079779426"),
("B", "2016-01-01", "2016-12-01", "0.999989415"),
("C", "2016-01-01", "2016-12-01", "0.0011999408"),
("C", "2016-01-01", "2016-12-01", "0.0087999426"),
("C", "2016-01-01", "2016-12-01", "0.0089899941")
).toDF("class_type","start_date","end_date","ratio")
.withColumn("start_date", to_date($"start_date" ,"yyyy-MM-dd").cast(DateType))
.withColumn("end_date", to_date($"end_date" ,"yyyy-MM-dd").cast(DateType))
.withColumn("ratio", col("ratio").cast(DoubleType))

df.show(200)

Given Table :

+----------+----------+----------+------------+
|class_type|start_date|  end_date|       ratio|
+----------+----------+----------+------------+
|         A|2016-01-01|2016-12-01| 0.044999408|
|         A|2016-01-01|2016-12-01|0.0449999426|
|         A|2016-01-01|2016-12-01| 0.045999415|

|         B|2016-01-01|2016-12-01|0.0787888909|
|         B|2016-01-01|2016-12-01| 0.079779426|
|         B|2016-01-01|2016-12-01| 0.999989415|


|         C|2016-01-01|2016-12-01|0.0011999408|
|         C|2016-01-01|2016-12-01|0.0087999426|
|         C|2016-01-01|2016-12-01|0.0089899941|
+----------+----------+----------+------------+

Excpected Table format

+----------+----------+------------+------------+------------+
|start_date|  end_date|           A|           B|           C|
+----------+----------+------------+------------+------------+
|2016-01-01|2016-12-01| 0.044999408|0.0787888909|0.0011999408|
|2016-01-01|2016-12-01|0.0449999426| 0.079779426|0.0087999426|
|2016-01-01|2016-12-01| 0.045999415| 0.999989415|0.0089899941|
+----------+----------+------------+------------+------------+

How can this be done ?

Tried this

val pivotDf = df.groupBy("start_date","end_date","class_type").pivot(col("class_type")).agg(first(col("ratio")))


+----------+----------+----------+-----------+------------+------------+
|start_date|  end_date|class_type|          A|           B|           C|
+----------+----------+----------+-----------+------------+------------+
|2016-01-01|2016-12-01|         A|0.044999408|        null|        null|
|2016-01-01|2016-12-01|         B|       null|0.0787888909|        null|
|2016-01-01|2016-12-01|         C|       null|        null|0.0011999408|
+----------+----------+----------+-----------+------------+------------+

回答1:

Based on data on sample example, you do not have any relation between ration and class_type for subsequent rows.

If it is already ordered then you can assign rank and then use this to pivot.

This is an example of doing using rank.

import org.apache.spark.sql.types.DoubleType
import org.apache.spark.sql.types.DateType

import org.apache.spark.sql.expressions.Window
val byRatio = Window.partitionBy(col("start_date"),col("end_date"),col("class_type")).orderBy(col("ratio"))

var df = Seq(
  ("A", "2016-01-01", "2016-12-01", "0.044999408"),
("A", "2016-01-01", "2016-12-01", "0.0449999426"),
("A", "2016-01-01", "2016-12-01", "0.045999415"),
("B", "2016-01-01", "2016-12-01", "0.0787888909"),
("B", "2016-01-01", "2016-12-01", "0.079779426"),
("B", "2016-01-01", "2016-12-01", "0.999989415"),
("C", "2016-01-01", "2016-12-01", "0.0011999408"),
("C", "2016-01-01", "2016-12-01", "0.0087999426"),
("C", "2016-01-01", "2016-12-01", "0.0089899941")
).toDF("class_type","start_date","end_date","ratio").
withColumn("start_date", to_date($"start_date" ,"yyyy-MM-dd").cast(DateType)).
withColumn("end_date", to_date($"end_date" ,"yyyy-MM-dd").cast(DateType)).
withColumn("ratio", col("ratio").cast(DoubleType))

df = df.withColumn("class_rank",rank over byRatio)

var pivotDf = df.groupBy("start_date","end_date","class_rank").pivot("class_type").agg(max(col("ratio")))
pivotDf = pivotDf.drop(col("class_rank"))
pivotDf.show(10,false)

Based on your data, you will get output like below:

+----------+----------+------------+------------+------------+
|start_date|end_date  |A           |B           |C           |
+----------+----------+------------+------------+------------+
|2016-01-01|2016-12-01|0.044999408 |0.0787888909|0.0011999408|
|2016-01-01|2016-12-01|0.0449999426|0.079779426 |0.0087999426|
|2016-01-01|2016-12-01|0.045999415 |0.999989415 |0.0089899941|
+----------+----------+------------+------------+------------+

来源：https://stackoverflow.com/questions/66167061/transposing-table-to-given-format-in-spark

标签

apache-spark

apache-spark-sql