I have this data frame
df = sc.parallelize([(1, [1, 2, 3]), (1, [4, 5, 6]) , (2,[2]),(2,[3])]).toDF([\"store\", \"values\"])
+-----+---------+
|store| values|
Now, it is possible to use the flatten function and things become a lot easier. You just have to flatten the collected array after the groupby.
# 1. Create the DF
df = sc.parallelize([(1, [1, 2, 3]), (1, [4, 5, 6]) , (2,[2]),(2,[3])]).toDF(["store","values"])
+-----+---------+
|store| values|
+-----+---------+
| 1|[1, 2, 3]|
| 1|[4, 5, 6]|
| 2| [2]|
| 2| [3]|
+-----+---------+
# 2. Group by store
df = df.groupBy("store").agg(F.collect_list("values"))
+-----+--------------------+
|store|collect_list(values)|
+-----+--------------------+
| 1|[[1, 2, 3], [4, 5...|
| 2| [[2], [3]]|
+-----+--------------------+
# 3. finally.... flat the array
df = df.withColumn("flatten_array", F.flatten("collect_list(values)"))
+-----+--------------------+------------------+
|store|collect_list(values)| flatten_array|
+-----+--------------------+------------------+
| 1|[[1, 2, 3], [4, 5...|[1, 2, 3, 4, 5, 6]|
| 2| [[2], [3]]| [2, 3]|
+-----+--------------------+------------------+