show distinct column values in pyspark dataframe: python

问题

Please suggest pyspark dataframe alternative for Pandas df['col'].unique().

I want to list out all the unique values in a pyspark dataframe column.

Not the SQL type way (registertemplate then SQL query for distinct values).

Also I don't need groupby->countDistinct, instead I want to check distinct VALUES in that column.

回答1:

Let's assume we're working with the following representation of data (two columns, k and v, where k contains three entries, two unique:

+---+---+
|  k|  v|
+---+---+
|foo|  1|
|bar|  2|
|foo|  3|
+---+---+

With a Pandas dataframe:

import pandas as pd
p_df = pd.DataFrame([("foo", 1), ("bar", 2), ("foo", 3)], columns=("k", "v"))
p_df['k'].unique()

This returns an ndarray, i.e. array(['foo', 'bar'], dtype=object)

You asked for a "pyspark dataframe alternative for pandas df['col'].unique()". Now, given the following Spark dataframe:

s_df = sqlContext.createDataFrame([("foo", 1), ("bar", 2), ("foo", 3)], ('k', 'v'))

If you want the same result from Spark, i.e. an ndarray, use toPandas():

s_df.toPandas()['k'].unique()

Alternatively, if you don't need an ndarray specifically and just want a list of the unique values of column k:

s_df.select('k').distinct().rdd.map(lambda r: r[0]).collect()

Finally, you can also use a list comprehension as follows:

[i.k for i in s_df.select('k').distinct().collect()]

回答2:

This should help to get distinct values of a column:

df.select('column1').distinct().show()

回答3:

You can use df.dropDuplicates(['col1','col2']) to get only distinct rows based on colX in the array.

回答4:

collect_set can help to get unique values from a given column of pyspark.sql.DataFrame df.select(F.collect_set("column").alias("column")).first()["column"]

来源：https://stackoverflow.com/questions/39383557/show-distinct-column-values-in-pyspark-dataframe-python

标签

pyspark

pyspark-sql