check number of unique values in each column of a matrix in spark

假装没事ソ 提交于 2019-12-06 03:30:26

You can use the countDistinct function on each column.

For example, in pyspark:

df = spark.createDataFrame([ (1, 1), (1, 3), (2, 1), (3, 2), (3, 3) ], ["user_id", "genre_id"])
>>> df.show()
+-------+--------+
|user_id|genre_id|
+-------+--------+
|      1|       1|
|      1|       3|
|      2|       1|
|      3|       2|
|      3|       3|
+-------+--------+

>>> import pyspark.sql.functions as F
>>> df.select( [ F.countDistinct(cn).alias("c_{0}".format(cn)) for cn in df.columns ] ).show()
+---------+----------+
|c_user_id|c_genre_id|
+---------+----------+
|        3|         3|
+---------+----------+
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!