问题
I want to transpose a dataframe. This is just a small excerpt from my original dataframe -
from pyspark.sql.functions import to_timestamp, date_format
valuesCol = [('22','ABC Ltd','U.K.','class 1',102),('22','ABC Ltd','U.K.','class 2',73),('22','ABC Ltd','U.K.','class 3',92),
('51','Eric AB','Sweden','class 1',52),('51','Eric AB','Sweden','class 2',34),('51','Eric AB','Sweden','class 3',11)]
df = sqlContext.createDataFrame(valuesCol,['ID','Firm','Country','Class','Revenue'])
df.show()
+---+-------+-------+-------+-------+
| ID| Firm|Country| Class|Revenue|
+---+-------+-------+-------+-------+
| 22|ABC Ltd| U.K.|class 1| 102|
| 22|ABC Ltd| U.K.|class 2| 73|
| 22|ABC Ltd| U.K.|class 3| 92|
| 51|Eric AB| Sweden|class 1| 52|
| 51|Eric AB| Sweden|class 2| 34|
| 51|Eric AB| Sweden|class 3| 11|
+---+-------+-------+-------+-------+
There is no transpose function in PySpark
as such. One way to achieve the requisite result is by creating 3 dataframes
on class1, class2 and class3
and then joining (left join
) them. But that could involve a reshuffle over the network, depending on hash partitioner, and is very costly. I am sure, there should be an elegant and a simple way.
Expected output:
+---+-------+-------+-------+-------+-------+
| ID| Firm|Country| Class1| Class2| Class3|
+---+-------+-------+-------+-------+-------+
| 22|ABC Ltd| U.K.| 102| 73| 92|
| 51|Eric AB| Sweden| 52| 34| 11|
+---+-------+-------+-------+-------+-------+
回答1:
Courtesy this link. We have to use an aggregate function while pivoting, as pivoting is always in context to aggregation. Aggregation function could be sum, count, mean, min or max, depending upon the output desired -
df = df.groupBy(["ID","Firm","Country"]).pivot("Class").sum("Revenue")
df.show()
+---+-------+-------+-------+-------+-------+
| ID| Firm|Country|class 1|class 2|class 3|
+---+-------+-------+-------+-------+-------+
| 51|Eric AB| Sweden| 52| 34| 11|
| 22|ABC Ltd| U.K.| 102| 73| 92|
+---+-------+-------+-------+-------+-------+
来源:https://stackoverflow.com/questions/53171023/pyspark-how-to-transpose-a-dataframe