Flatten Nested Spark Dataframe

问题

Is there a way to flatten an arbitrarily nested Spark Dataframe? Most of the work I'm seeing is written for specific schema, and I'd like to be able to generically flatten a Dataframe with different nested types (e.g. StructType, ArrayType, MapType, etc).

Say I have a schema like:

StructType(List(StructField(field1,...), StructField(field2,...), ArrayType(StructType(List(StructField(nested_field1,...), StructField(nested_field2,...)),nested_array,...)))

Looking to adapt this into a flat table with a structure like:

field1
field2
nested_array.nested_field1
nested_array.nested_field2

FYI, looking for suggestions for Pyspark, but other flavors of Spark are also appreciated.

回答1:

This issue might be a bit old, but for anyone out there still looking for a solution you can flatten complex data types inline using select *:

first let's create the nested dataframe:

from pyspark.sql import HiveContext
hc = HiveContext(sc)
nested_df = hc.read.json(sc.parallelize(["""
{
  "field1": 1, 
  "field2": 2, 
  "nested_array":{
     "nested_field1": 3,
     "nested_field2": 4
  }
}
"""]))

now to flatten it:

flat_df = nested_df.select("field1", "field2", "nested_array.*")

You'll find useful examples here: https://docs.databricks.com/delta/data-transformation/complex-types.html

If you have too many nested arrays, you can use:

flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
flat_df = nested_df.select(*flat_cols, *[c + ".*" for c in nested_cols])

回答2:

I've developed a recursively approach to flatten any nested DataFrame. The implementation is on the AWS Data Wrangler project:

import awswrangler    

session = awswrangler.Session(spark_session=spark)
dfs = session.spark.flatten(dataframe=df_nested)
for name, df_flat in dfs.items():
    print(name)
    df_flat.show()

Or check the sources to see the raw implementation.

回答3:

Here's my final approach:

1) Map the rows in the dataframe to an rdd of dict. Find suitable python code online for flattening dict.

flat_rdd = nested_df.map(lambda x : flatten(x))

where

def flatten(x):
  x_dict = x.asDict()
  ...some flattening code...
  return x_dict

2) Convert the RDD[dict] back to a dataframe

flat_df = sqlContext.createDataFrame(flat_rdd)

回答4:

The following gist will flatten the structure of the nested json,

import typing as T

import cytoolz.curried as tz
import pyspark


def schema_to_columns(schema: pyspark.sql.types.StructType) -> T.List[T.List[str]]:
    """
    Produce a flat list of column specs from a possibly nested DataFrame schema
    """

    columns = list()

    def helper(schm: pyspark.sql.types.StructType, prefix: list = None):

        if prefix is None:
            prefix = list()

        for item in schm.fields:
            if isinstance(item.dataType, pyspark.sql.types.StructType):
                helper(item.dataType, prefix + [item.name])
            else:
                columns.append(prefix + [item.name])

    helper(schema)

    return columns

def flatten_frame(frame: pyspark.sql.DataFrame) -> pyspark.sql.DataFrame:

    aliased_columns = list()

    for col_spec in schema_to_columns(frame.schema):
        c = tz.get_in(col_spec, frame)
        if len(col_spec) == 1:
            aliased_columns.append(c)
        else:
            aliased_columns.append(c.alias(':'.join(col_spec)))

    return frame.select(aliased_columns)

You can then flatten the nested data as

flatten_data = flatten_frame(nested_df)

This will give you the flatten dataframe.

The gist was taken from https://gist.github.com/DGrady/b7e7ff3a80d7ee16b168eb84603f5599

来源：https://stackoverflow.com/questions/34271398/flatten-nested-spark-dataframe

标签

apache-spark

pyspark

spark-dataframe