Flatten Nested Spark Dataframe

前端 未结 4 1802
-上瘾入骨i
-上瘾入骨i 2020-12-03 08:28

Is there a way to flatten an arbitrarily nested Spark Dataframe? Most of the work I\'m seeing is written for specific schema, and I\'d like to be able to generically flatten

4条回答
  •  暖寄归人
    2020-12-03 09:02

    This issue might be a bit old, but for anyone out there still looking for a solution you can flatten complex data types inline using select *:

    first let's create the nested dataframe:

    from pyspark.sql import HiveContext
    hc = HiveContext(sc)
    nested_df = hc.read.json(sc.parallelize(["""
    {
      "field1": 1, 
      "field2": 2, 
      "nested_array":{
         "nested_field1": 3,
         "nested_field2": 4
      }
    }
    """]))
    

    now to flatten it:

    flat_df = nested_df.select("field1", "field2", "nested_array.*")
    

    You'll find useful examples here: https://docs.databricks.com/delta/data-transformation/complex-types.html

    If you have too many nested arrays, you can use:

    flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
    nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
    flat_df = nested_df.select(*flat_cols, *[c + ".*" for c in nested_cols])
    

提交回复
热议问题