Flatten Nested Spark Dataframe

前端 未结 4 1774
-上瘾入骨i
-上瘾入骨i 2020-12-03 08:28

Is there a way to flatten an arbitrarily nested Spark Dataframe? Most of the work I\'m seeing is written for specific schema, and I\'d like to be able to generically flatten

相关标签:
4条回答
  • 2020-12-03 09:02

    This issue might be a bit old, but for anyone out there still looking for a solution you can flatten complex data types inline using select *:

    first let's create the nested dataframe:

    from pyspark.sql import HiveContext
    hc = HiveContext(sc)
    nested_df = hc.read.json(sc.parallelize(["""
    {
      "field1": 1, 
      "field2": 2, 
      "nested_array":{
         "nested_field1": 3,
         "nested_field2": 4
      }
    }
    """]))
    

    now to flatten it:

    flat_df = nested_df.select("field1", "field2", "nested_array.*")
    

    You'll find useful examples here: https://docs.databricks.com/delta/data-transformation/complex-types.html

    If you have too many nested arrays, you can use:

    flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
    nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
    flat_df = nested_df.select(*flat_cols, *[c + ".*" for c in nested_cols])
    
    0 讨论(0)
  • 2020-12-03 09:10

    I've developed a recursively approach to flatten any nested DataFrame.

    The implementation is on the AWS Data Wrangler code base on GitHub.

    P.S. The Spark support was deprecated in the package, but the code base stills useful.

    0 讨论(0)
  • 2020-12-03 09:11

    The following gist will flatten the structure of the nested json,

    import typing as T
    
    import cytoolz.curried as tz
    import pyspark
    
    
    def schema_to_columns(schema: pyspark.sql.types.StructType) -> T.List[T.List[str]]:
        """
        Produce a flat list of column specs from a possibly nested DataFrame schema
        """
    
        columns = list()
    
        def helper(schm: pyspark.sql.types.StructType, prefix: list = None):
    
            if prefix is None:
                prefix = list()
    
            for item in schm.fields:
                if isinstance(item.dataType, pyspark.sql.types.StructType):
                    helper(item.dataType, prefix + [item.name])
                else:
                    columns.append(prefix + [item.name])
    
        helper(schema)
    
        return columns
    
    def flatten_frame(frame: pyspark.sql.DataFrame) -> pyspark.sql.DataFrame:
    
        aliased_columns = list()
    
        for col_spec in schema_to_columns(frame.schema):
            c = tz.get_in(col_spec, frame)
            if len(col_spec) == 1:
                aliased_columns.append(c)
            else:
                aliased_columns.append(c.alias(':'.join(col_spec)))
    
        return frame.select(aliased_columns)
    

    You can then flatten the nested data as

    flatten_data = flatten_frame(nested_df)

    This will give you the flatten dataframe.

    The gist was taken from https://gist.github.com/DGrady/b7e7ff3a80d7ee16b168eb84603f5599

    0 讨论(0)
  • 2020-12-03 09:19

    Here's my final approach:

    1) Map the rows in the dataframe to an rdd of dict. Find suitable python code online for flattening dict.

    flat_rdd = nested_df.map(lambda x : flatten(x))
    

    where

    def flatten(x):
      x_dict = x.asDict()
      ...some flattening code...
      return x_dict
    

    2) Convert the RDD[dict] back to a dataframe

    flat_df = sqlContext.createDataFrame(flat_rdd)
    
    0 讨论(0)
提交回复
热议问题