问题
The current Pyspark dataframe has this structure (a list of WrappedArrays for col2):
+---+---------------------------------------------------------------------+
|id |col2 |
+---+---------------------------------------------------------------------+
|a |[WrappedArray(code2), WrappedArray(code1, code3)] |
+---+---------------------------------------------------------------------+
|b |[WrappedArray(code5), WrappedArray(code6, code8)] |
+---+---------------------------------------------------------------------+
This is the structure I would like to have (a flattened list for col2):
+---+---------------------------------------------------------------------+
|id |col2 |
+---+---------------------------------------------------------------------+
|a |[code2,code1, code3)] |
+---+---------------------------------------------------------------------+
|b |[code5,code6, code8] |
+---+---------------------------------------------------------------------+
but I'm not sure how to do that transformation. I had tried to do a flatmap but that didn't seem to work. Any suggestions?
回答1:
You can do this using 2 ways, udf and rdd. Here is example:-
df = sqlContext.createDataFrame([
['a', [['code2'],['code1', 'code3']]],
['b', [['code5','code6'], ['code8']]]
], ["id", "col2"])
df.show(truncate = False)
+---+-------------------------------------------------+
|id |col2 |
+---+-------------------------------------------------+
|a |[WrappedArray(code2), WrappedArray(code1, code3)]|
|b |[WrappedArray(code5, code6), WrappedArray(code8)]|
+---+-------------------------------------------------+
RDD:-
df.map(lambda row:(row[0], reduce(lambda x,y:x+y, row[1]))).toDF().show(truncate=False)
+---+---------------------+
|_1 |_2 |
+---+---------------------+
|a |[code2, code1, code3]|
|b |[code5, code6, code8]|
+---+---------------------+
UDF:-
from pyspark.sql import functions as F
import pyspark.sql.types as T
def fudf(val):
#emlist = []
#for item in val:
# emlist += item
#return emlist
return reduce (lambda x, y:x+y, val)
flattenUdf = F.udf(fudf, T.ArrayType(T.StringType()))
df.select("id", flattenUdf("col2").alias("col2")).show(truncate=False)
+---+---------------------+
|id |col2 |
+---+---------------------+
|a |[code2, code1, code3]|
|b |[code5, code6, code8]|
+---+---------------------+
来源:https://stackoverflow.com/questions/46289068/pyspark-merge-wrappedarrays-within-a-dataframe