PySpark - How to transpose a Dataframe [duplicate]

自闭症网瘾萝莉.ら 提交于 2019-12-02 16:23:05

问题


I want to transpose a dataframe. This is just a small excerpt from my original dataframe -

from pyspark.sql.functions import to_timestamp, date_format 
valuesCol = [('22','ABC Ltd','U.K.','class 1',102),('22','ABC Ltd','U.K.','class 2',73),('22','ABC Ltd','U.K.','class 3',92),
             ('51','Eric AB','Sweden','class 1',52),('51','Eric AB','Sweden','class 2',34),('51','Eric AB','Sweden','class 3',11)]
df = sqlContext.createDataFrame(valuesCol,['ID','Firm','Country','Class','Revenue'])
df.show()
+---+-------+-------+-------+-------+
| ID|   Firm|Country|  Class|Revenue|
+---+-------+-------+-------+-------+
| 22|ABC Ltd|   U.K.|class 1|    102|
| 22|ABC Ltd|   U.K.|class 2|     73|
| 22|ABC Ltd|   U.K.|class 3|     92|
| 51|Eric AB| Sweden|class 1|     52|
| 51|Eric AB| Sweden|class 2|     34|
| 51|Eric AB| Sweden|class 3|     11|
+---+-------+-------+-------+-------+

There is no transpose function in PySpark as such. One way to achieve the requisite result is by creating 3 dataframes on class1, class2 and class3 and then joining (left join) them. But that could involve a reshuffle over the network, depending on hash partitioner, and is very costly. I am sure, there should be an elegant and a simple way.

Expected output:

+---+-------+-------+-------+-------+-------+
| ID|   Firm|Country| Class1| Class2| Class3|
+---+-------+-------+-------+-------+-------+
| 22|ABC Ltd|   U.K.|    102|     73|     92|
| 51|Eric AB| Sweden|     52|     34|     11|
+---+-------+-------+-------+-------+-------+

回答1:


Courtesy this link. We have to use an aggregate function while pivoting, as pivoting is always in context to aggregation. Aggregation function could be sum, count, mean, min or max, depending upon the output desired -

df = df.groupBy(["ID","Firm","Country"]).pivot("Class").sum("Revenue")
df.show()
+---+-------+-------+-------+-------+-------+
| ID|   Firm|Country|class 1|class 2|class 3|
+---+-------+-------+-------+-------+-------+
| 51|Eric AB| Sweden|     52|     34|     11|
| 22|ABC Ltd|   U.K.|    102|     73|     92|
+---+-------+-------+-------+-------+-------+


来源:https://stackoverflow.com/questions/53171023/pyspark-how-to-transpose-a-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!