pyspark; check if an element is in collect_list [duplicate]

試著忘記壹切 提交于 2019-12-21 16:55:11

问题


I am working on a dataframe df, for instance the following dataframe:

df.show()

Output:

+----+------+
|keys|values|
+----+------+
|  aa| apple|
|  bb|orange|
|  bb|  desk|
|  bb|orange|
|  bb|  desk|
|  aa|   pen|
|  bb|pencil|
|  aa| chair|
+----+------+

I use collect_set to aggregate and get a set of objects with duplicate elements eliminated (or collect_list to get list of objects).

df_new = df.groupby('keys').agg(collect_set(df.values).alias('collectedSet_values'))

The resulting dataframe is then as follows:

df_new.show()

Output:

+----+----------------------+
|keys|collectedSet_values   |
+----+----------------------+
|bb  |[orange, pencil, desk]|
|aa  |[apple, pen, chair]   |
+----+----------------------+

I am struggling to find a way to see if a specific keyword (like 'chair') is in the resulting set of objects (in column collectedSet_values). I do not want to go with udf solution.

Please comment your solutions/ideas.

Kind Regards.


回答1:


Actually there is a nice function array_contains which does that for us. The way we use it for set of objects is the same as in here. To know if word 'chair' exists in each set of object, we can simply do the following:

df_new.withColumn('contains_chair', array_contains(df_new.collectedSet_values, 'chair')).show()

Output:

+----+----------------------+--------------+
|keys|collectedSet_values   |contains_chair|
+----+----------------------+--------------+
|bb  |[orange, pencil, desk]|false         |
|aa  |[apple, pen, chair]   |true          |
+----+----------------------+--------------+

The same applies to the result of collect_list.



来源:https://stackoverflow.com/questions/51499460/pyspark-check-if-an-element-is-in-collect-list

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!