Transposing table to given format in spark [duplicate]

六月ゝ 毕业季﹏ 提交于 2021-02-17 05:09:18

问题


I am using sparkv2.4.1, Have a scenario , where i need to convert given table structred as below

val df = Seq(
  ("A", "2016-01-01", "2016-12-01", "0.044999408"),
("A", "2016-01-01", "2016-12-01", "0.0449999426"),
("A", "2016-01-01", "2016-12-01", "0.045999415"),
("B", "2016-01-01", "2016-12-01", "0.0787888909"),
("B", "2016-01-01", "2016-12-01", "0.079779426"),
("B", "2016-01-01", "2016-12-01", "0.999989415"),
("C", "2016-01-01", "2016-12-01", "0.0011999408"),
("C", "2016-01-01", "2016-12-01", "0.0087999426"),
("C", "2016-01-01", "2016-12-01", "0.0089899941")
).toDF("class_type","start_date","end_date","ratio")
.withColumn("start_date", to_date($"start_date" ,"yyyy-MM-dd").cast(DateType))
.withColumn("end_date", to_date($"end_date" ,"yyyy-MM-dd").cast(DateType))
.withColumn("ratio", col("ratio").cast(DoubleType))

df.show(200)

Given Table :

+----------+----------+----------+------------+
|class_type|start_date|  end_date|       ratio|
+----------+----------+----------+------------+
|         A|2016-01-01|2016-12-01| 0.044999408|
|         A|2016-01-01|2016-12-01|0.0449999426|
|         A|2016-01-01|2016-12-01| 0.045999415|

|         B|2016-01-01|2016-12-01|0.0787888909|
|         B|2016-01-01|2016-12-01| 0.079779426|
|         B|2016-01-01|2016-12-01| 0.999989415|


|         C|2016-01-01|2016-12-01|0.0011999408|
|         C|2016-01-01|2016-12-01|0.0087999426|
|         C|2016-01-01|2016-12-01|0.0089899941|
+----------+----------+----------+------------+

Excpected Table format

+----------+----------+------------+------------+------------+
|start_date|  end_date|           A|           B|           C|
+----------+----------+------------+------------+------------+
|2016-01-01|2016-12-01| 0.044999408|0.0787888909|0.0011999408|
|2016-01-01|2016-12-01|0.0449999426| 0.079779426|0.0087999426|
|2016-01-01|2016-12-01| 0.045999415| 0.999989415|0.0089899941|
+----------+----------+------------+------------+------------+

How can this be done ?

Tried this

val pivotDf = df.groupBy("start_date","end_date","class_type").pivot(col("class_type")).agg(first(col("ratio")))


+----------+----------+----------+-----------+------------+------------+
|start_date|  end_date|class_type|          A|           B|           C|
+----------+----------+----------+-----------+------------+------------+
|2016-01-01|2016-12-01|         A|0.044999408|        null|        null|
|2016-01-01|2016-12-01|         B|       null|0.0787888909|        null|
|2016-01-01|2016-12-01|         C|       null|        null|0.0011999408|
+----------+----------+----------+-----------+------------+------------+

回答1:


Based on data on sample example, you do not have any relation between ration and class_type for subsequent rows.

If it is already ordered then you can assign rank and then use this to pivot.

This is an example of doing using rank.

import org.apache.spark.sql.types.DoubleType
import org.apache.spark.sql.types.DateType

import org.apache.spark.sql.expressions.Window
val byRatio = Window.partitionBy(col("start_date"),col("end_date"),col("class_type")).orderBy(col("ratio"))

var df = Seq(
  ("A", "2016-01-01", "2016-12-01", "0.044999408"),
("A", "2016-01-01", "2016-12-01", "0.0449999426"),
("A", "2016-01-01", "2016-12-01", "0.045999415"),
("B", "2016-01-01", "2016-12-01", "0.0787888909"),
("B", "2016-01-01", "2016-12-01", "0.079779426"),
("B", "2016-01-01", "2016-12-01", "0.999989415"),
("C", "2016-01-01", "2016-12-01", "0.0011999408"),
("C", "2016-01-01", "2016-12-01", "0.0087999426"),
("C", "2016-01-01", "2016-12-01", "0.0089899941")
).toDF("class_type","start_date","end_date","ratio").
withColumn("start_date", to_date($"start_date" ,"yyyy-MM-dd").cast(DateType)).
withColumn("end_date", to_date($"end_date" ,"yyyy-MM-dd").cast(DateType)).
withColumn("ratio", col("ratio").cast(DoubleType))

df = df.withColumn("class_rank",rank over byRatio)

var pivotDf = df.groupBy("start_date","end_date","class_rank").pivot("class_type").agg(max(col("ratio")))
pivotDf = pivotDf.drop(col("class_rank"))
pivotDf.show(10,false)

Based on your data, you will get output like below:

+----------+----------+------------+------------+------------+
|start_date|end_date  |A           |B           |C           |
+----------+----------+------------+------------+------------+
|2016-01-01|2016-12-01|0.044999408 |0.0787888909|0.0011999408|
|2016-01-01|2016-12-01|0.0449999426|0.079779426 |0.0087999426|
|2016-01-01|2016-12-01|0.045999415 |0.999989415 |0.0089899941|
+----------+----------+------------+------------+------------+


来源:https://stackoverflow.com/questions/66167061/transposing-table-to-given-format-in-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!