Filtering on number of times a value appears in PySpark

问题

I have a file with a column containing IDs. Usually, an ID appears only once, but occasionally, they're associated with multiple records. I want to count how many times a given ID appeared, and then split into two separate dfs so I can run different operations on both. One df should be where IDs only appear once, and one should be where IDs appear multiple times.

I was able to successfully count the number of instances an ID appeared by grouping on ID and joining the counts back onto the original df, like so:

newdf = df.join(df.groupBy('ID').count(),on='ID')

This works nicely, as I get an output like so:

ID      Thing  count
287099  Foo     3
287099  Bar     3
287099  Foobar  3
321244  Barbar  1
333032  Barfoo  2
333032  Foofoo  2

But, now I want to split the df so that I have a df where count = 1, and count > 1. The below and variations thereof didn't work, however:

singular = df2.filter(df2.count == 1)

I get a 'TypeError: condition should be string or Column' error instead. When I tried displaying the type of the column, it says the count column is an instance. How can I get PySpark to treat the count column the way I need it to?

回答1:

count is a method of dataframe,

>>> df2.count
<bound method DataFrame.count of DataFrame[id: bigint, count: bigint]>

Where as filter needs a column to operate on, change it as below,

singular = df2.filter(df2['count'] == 1)

来源：https://stackoverflow.com/questions/45395093/filtering-on-number-of-times-a-value-appears-in-pyspark

标签

python

pyspark

databricks

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!