show distinct column values in pyspark dataframe: python

后端未结

关注

 9  865

忘了有多久 2020-12-23 10:55

Please suggest pyspark dataframe alternative for Pandas df[\'col\'].unique().

I want to list out all the unique values in a pyspark dataframe column.

9条回答

旧时难觅i (楼主)

2020-12-23 11:35

In addition to the dropDuplicates option there is the method named as we know it in pandas drop_duplicates:

drop_duplicates() is an alias for dropDuplicates().

Example

s_df = sqlContext.createDataFrame([("foo", 1),
                                   ("foo", 1),
                                   ("bar", 2),
                                   ("foo", 3)], ('k', 'v'))
s_df.show()

+---+---+
|  k|  v|
+---+---+
|foo|  1|
|foo|  1|
|bar|  2|
|foo|  3|
+---+---+

Drop by subset

s_df.drop_duplicates(subset = ['k']).show()

+---+---+
|  k|  v|
+---+---+
|bar|  2|
|foo|  1|
+---+---+
s_df.drop_duplicates().show()


+---+---+
|  k|  v|
+---+---+
|bar|  2|
|foo|  3|
|foo|  1|
+---+---+

0 讨论(0)

查看其它9个回答