Using Collect_set after exploding in a groupedBy object in Pyspark
问题 I have a data-frame which has schema like this : root |-- docId: string (nullable = true) |-- field_a: array (nullable = true) | |-- element: string (containsNull = true) |-- field_b: array (nullable = true) | |-- element: string (containsNull = true) I want to perform a groupBy on field_a and use collect_set to keep all the distinct values (basically inner values in the list) in the field_b in aggregation, I don't want to add a new column by exploding field_b and then do collect_set in