问题
I have a file with a column containing IDs. Usually, an ID appears only once, but occasionally, they're associated with multiple records. I want to count how many times a given ID appeared, and then split into two separate dfs so I can run different operations on both. One df should be where IDs only appear once, and one should be where IDs appear multiple times.
I was able to successfully count the number of instances an ID appeared by grouping on ID and joining the counts back onto the original df, like so:
newdf = df.join(df.groupBy('ID').count(),on='ID')
This works nicely, as I get an output like so:
ID Thing count
287099 Foo 3
287099 Bar 3
287099 Foobar 3
321244 Barbar 1
333032 Barfoo 2
333032 Foofoo 2
But, now I want to split the df so that I have a df where count = 1, and count > 1. The below and variations thereof didn't work, however:
singular = df2.filter(df2.count == 1)
I get a 'TypeError: condition should be string or Column' error instead. When I tried displaying the type of the column, it says the count column is an instance. How can I get PySpark to treat the count column the way I need it to?
回答1:
count is a method of dataframe,
>>> df2.count
<bound method DataFrame.count of DataFrame[id: bigint, count: bigint]>
Where as filter needs a column to operate on, change it as below,
singular = df2.filter(df2['count'] == 1)
来源:https://stackoverflow.com/questions/45395093/filtering-on-number-of-times-a-value-appears-in-pyspark