PySpark - How to transpose a Dataframe [duplicate]

问题

I want to transpose a dataframe. This is just a small excerpt from my original dataframe -

from pyspark.sql.functions import to_timestamp, date_format 
valuesCol = [('22','ABC Ltd','U.K.','class 1',102),('22','ABC Ltd','U.K.','class 2',73),('22','ABC Ltd','U.K.','class 3',92),
             ('51','Eric AB','Sweden','class 1',52),('51','Eric AB','Sweden','class 2',34),('51','Eric AB','Sweden','class 3',11)]
df = sqlContext.createDataFrame(valuesCol,['ID','Firm','Country','Class','Revenue'])
df.show()
+---+-------+-------+-------+-------+
| ID|   Firm|Country|  Class|Revenue|
+---+-------+-------+-------+-------+
| 22|ABC Ltd|   U.K.|class 1|    102|
| 22|ABC Ltd|   U.K.|class 2|     73|
| 22|ABC Ltd|   U.K.|class 3|     92|
| 51|Eric AB| Sweden|class 1|     52|
| 51|Eric AB| Sweden|class 2|     34|
| 51|Eric AB| Sweden|class 3|     11|
+---+-------+-------+-------+-------+

There is no transpose function in PySpark as such. One way to achieve the requisite result is by creating 3 dataframes on class1, class2 and class3 and then joining (left join) them. But that could involve a reshuffle over the network, depending on hash partitioner, and is very costly. I am sure, there should be an elegant and a simple way.

Expected output:

+---+-------+-------+-------+-------+-------+
| ID|   Firm|Country| Class1| Class2| Class3|
+---+-------+-------+-------+-------+-------+
| 22|ABC Ltd|   U.K.|    102|     73|     92|
| 51|Eric AB| Sweden|     52|     34|     11|
+---+-------+-------+-------+-------+-------+

回答1:

Courtesy this link. We have to use an aggregate function while pivoting, as pivoting is always in context to aggregation. Aggregation function could be sum, count, mean, min or max, depending upon the output desired -

df = df.groupBy(["ID","Firm","Country"]).pivot("Class").sum("Revenue")
df.show()
+---+-------+-------+-------+-------+-------+
| ID|   Firm|Country|class 1|class 2|class 3|
+---+-------+-------+-------+-------+-------+
| 51|Eric AB| Sweden|     52|     34|     11|
| 22|ABC Ltd|   U.K.|    102|     73|     92|
+---+-------+-------+-------+-------+-------+

来源：https://stackoverflow.com/questions/53171023/pyspark-how-to-transpose-a-dataframe

标签

python

apache-spark

dataframe

pyspark

transpose