Pyspark Dataframe get unique elements from column with string as list of elements

我的未来我决定 提交于 2021-02-19 07:34:05

问题


I have a dataframe (which is created by loading from multiple blobs in azure) where I have a column which is list of IDs. Now, I want a list of unique IDs from this entire column:

Here is an example -

df - 
| col1 | col2 | col3  |
| "a"  | "b"  |"[q,r]"|
| "c"  | "f"  |"[s,r]"|

Here is my expected response:

resp = [q, r, s]

Any idea how to get there?

My current approach is to convert the strings in col3 to python lists and then maybe flaten them out somehow.

But so far I am not able to do so. I tried using user defined functions in pyspark but they only return strings and not lists.

FlatMaps only work on RDD not on Dataframes so they are out of picture.

Maybe there is way where I can specify this during the conversion from RDD to dataframe. But not sure how to do that.


回答1:


Here is a method using only DataFrame functions:

df = spark.createDataFrame([('a','b','[q,r,p]'),('c','f','[s,r]')],['col1','col2','col3'])

df=df.withColumn('col4', f.split(f.regexp_extract('col3', '\[(.*)\]',1), ','))

df.select(f.explode('col4').alias('exploded')).groupby('exploded').count().show()



回答2:


we can use UDF along with collect_list. I tried my way,

>>> from pyspark.sql import functions as F
>>> from pyspark.sql.types import *
>>> from functools import reduce

>>> df = spark.createDataFrame([('a','b','[q,r]'),('c','f','[s,r]')],['col1','col2','col3'])
>>> df.show()
+----+----+-----+
|col1|col2| col3|
+----+----+-----+
|   a|   b|[q,r]|
|   c|   f|[s,r]|
+----+----+-----+

>>> udf1 = F.udf(lambda x : [v for v in reduce(lambda x,y : set(x+y),d) if v not in ['[',']',',']],ArrayType(StringType()))
## col3 value is string of list. we concat the strings and set over it which removes duplicates.
## Also, we have converted string to set, means it will return [ ] , as values( like '[',']',',').we remove those.

>>> df.select(udf1(F.collect_list('col3')).alias('col3')).first().col3
['q', 'r', 's']

Not sure about performance. Hope this helps.!



来源:https://stackoverflow.com/questions/47793412/pyspark-dataframe-get-unique-elements-from-column-with-string-as-list-of-element

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!