Is there a way to flatten an arbitrarily nested Spark Dataframe? Most of the work I\'m seeing is written for specific schema, and I\'d like to be able to generically flatten
This issue might be a bit old, but for anyone out there still looking for a solution you can flatten complex data types inline using select *:
first let's create the nested dataframe:
from pyspark.sql import HiveContext
hc = HiveContext(sc)
nested_df = hc.read.json(sc.parallelize(["""
{
"field1": 1,
"field2": 2,
"nested_array":{
"nested_field1": 3,
"nested_field2": 4
}
}
"""]))
now to flatten it:
flat_df = nested_df.select("field1", "field2", "nested_array.*")
You'll find useful examples here: https://docs.databricks.com/delta/data-transformation/complex-types.html
If you have too many nested arrays, you can use:
flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
flat_df = nested_df.select(*flat_cols, *[c + ".*" for c in nested_cols])