show distinct column values in pyspark dataframe: python

后端 未结 9 865
忘了有多久
忘了有多久 2020-12-23 10:55

Please suggest pyspark dataframe alternative for Pandas df[\'col\'].unique().

I want to list out all the unique values in a pyspark dataframe column.

9条回答
  •  旧时难觅i
    2020-12-23 11:35

    In addition to the dropDuplicates option there is the method named as we know it in pandas drop_duplicates:

    drop_duplicates() is an alias for dropDuplicates().

    Example

    s_df = sqlContext.createDataFrame([("foo", 1),
                                       ("foo", 1),
                                       ("bar", 2),
                                       ("foo", 3)], ('k', 'v'))
    s_df.show()
    
    +---+---+
    |  k|  v|
    +---+---+
    |foo|  1|
    |foo|  1|
    |bar|  2|
    |foo|  3|
    +---+---+
    

    Drop by subset

    s_df.drop_duplicates(subset = ['k']).show()
    
    +---+---+
    |  k|  v|
    +---+---+
    |bar|  2|
    |foo|  1|
    +---+---+
    s_df.drop_duplicates().show()
    
    
    +---+---+
    |  k|  v|
    +---+---+
    |bar|  2|
    |foo|  3|
    |foo|  1|
    +---+---+
    

提交回复
热议问题