Pyspark groupBy DataFrame without aggregation or count

狂风中的少年 提交于 2021-02-10 12:18:09

问题


Can it iterate through the Pyspark groupBy dataframe without aggregation or count?

For example code in Pandas:

for i, d in df2:
      mycode ....

^^ if using pandas ^^
Is there a difference in how to iterate groupby in Pyspark or have to use aggregation and count?

回答1:


At best you can use .first , .last to get respective values from the groupBy but not all in the way you can get in pandas.

ex:

from pyspark.sql import functions as f
df.groupBy(df['some_col']).agg(f.first(df['col1']), f.first(df['col2'])).show()

Since their is a basic difference between the way the data is handled in pandas and spark not all functionalities can be used in the same way.

Their are a few work arounds to get what you want like:

for diamonds DataFrame:

+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+
|_c0|carat|      cut|color|clarity|depth|table|price|   x|   y|   z|
+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+
|  1| 0.23|    Ideal|    E|    SI2| 61.5| 55.0|  326|3.95|3.98|2.43|
|  2| 0.21|  Premium|    E|    SI1| 59.8| 61.0|  326|3.89|3.84|2.31|
|  3| 0.23|     Good|    E|    VS1| 56.9| 65.0|  327|4.05|4.07|2.31|
|  4| 0.29|  Premium|    I|    VS2| 62.4| 58.0|  334| 4.2|4.23|2.63|
|  5| 0.31|     Good|    J|    SI2| 63.3| 58.0|  335|4.34|4.35|2.75|
+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+

You can use:

l=[x.cut for x in diamonds.select("cut").distinct().rdd.collect()]
def groups(df,i):
  import pyspark.sql.functions as f
  return df.filter(f.col("cut")==i)

#for multi grouping
def groups_multi(df,i):
  import pyspark.sql.functions as f
  return df.filter((f.col("cut")==i) & (f.col("color")=='E'))# use | for or.

for i in l:
  groups(diamonds,i).show(2)

output :

+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
|_c0|carat|    cut|color|clarity|depth|table|price|   x|   y|   z|
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
|  2| 0.21|Premium|    E|    SI1| 59.8| 61.0|  326|3.89|3.84|2.31|
|  4| 0.29|Premium|    I|    VS2| 62.4| 58.0|  334| 4.2|4.23|2.63|
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
only showing top 2 rows

+---+-----+-----+-----+-------+-----+-----+-----+----+----+----+
|_c0|carat|  cut|color|clarity|depth|table|price|   x|   y|   z|
+---+-----+-----+-----+-------+-----+-----+-----+----+----+----+
|  1| 0.23|Ideal|    E|    SI2| 61.5| 55.0|  326|3.95|3.98|2.43|
| 12| 0.23|Ideal|    J|    VS1| 62.8| 56.0|  340|3.93| 3.9|2.46|
+---+-----+-----+-----+-------+-----+-----+-----+----+----+----+

...

In Function groups you can decide what kind of grouping you want for the data. It is a simple filter condition but it will get you all the groups separately.




回答2:


When we do a GroupBy we end up with a RelationalGroupedDataset, which is a fancy name for a DataFrame that has a grouping specified but needs the user to specify an aggregation before it can be queried further.

When you try to do any functions on the Grouped dataframe it throws an error

AttributeError: 'GroupedData' object has no attribute 'show'


来源:https://stackoverflow.com/questions/59622573/pyspark-groupby-dataframe-without-aggregation-or-count

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!