Please suggest pyspark dataframe alternative for Pandas df[\'col\'].unique()
.
I want to list out all the unique values in a pyspark dataframe column.
If you want to see the distinct values of a specific column in your dataframe , you would just need to write -
df.select('colname').distinct().show(100,False)
This would show the 100 distinct values (if 100 values are available) for the colname column in the df dataframe.
If you want to do something fancy on the distinct values, you can save the distinct values in a vector
a = df.select('colname').distinct()
Here, a would have all the distinct values of the column colname