Rename nested field in spark dataframe

后端 未结 3 482
无人及你
无人及你 2020-11-27 16:33

Having a dataframe df in Spark:

 |-- array_field: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- a: stri         


        
3条回答
  •  予麋鹿
    予麋鹿 (楼主)
    2020-11-27 16:56

    You can recurse over the data frame's schema to create a new schema with the required changes.

    A schema in PySpark is a StructType which holds a list of StructFields and each StructField can hold some primitve type or another StructType.

    This means that we can decide if we want to recurse based on whether the type is a StructType or not.

    Below is an annotated sample implementation that shows you how you can implement the above idea.

    # Some imports
    from pyspark.sql import *
    from copy import copy
    
    # We take a dataframe and return a new one with required changes
    def cleanDataFrame(df: DataFrame) -> DataFrame:
        # Returns a new sanitized field name (this function can be anything really)
        def sanitizeFieldName(s: str) -> str:
            return s.replace("-", "_").replace("&", "_").replace("\"", "_")\
                .replace("[", "_").replace("]", "_").replace(".", "_")
    
        # We call this on all fields to create a copy and to perform any changes we might
        # want to do to the field.
        def sanitizeField(field: StructField) -> StructField:
            field = copy(field)
            field.name = sanitizeFieldName(field.name)
            # We recursively call cleanSchema on all types
            field.dataType = cleanSchema(field.dataType)
            return field
    
        def cleanSchema(dataType: [DataType]) -> [DateType]:
            dataType = copy(dataType)
            # If the type is a StructType we need to recurse otherwise we can return since
            # we've reached the leaf node
            if isinstance(dataType, StructType):
                # We call our sanitizer for all top level fields
                dataType.fields = [sanitizeField(f) for f in dataType.fields]
            elif isinstance(dataType, ArrayType):
                dataType.elementType = cleanSchema(dataType.elementType)
            return dataType
    
        # Now since we have the new schema we can create a new DataFrame by using the old Frame's RDD as data and the new schema as the schema for the data
        return spark.createDataFrame(df.rdd, cleanSchema(df.schema))
    

提交回复
热议问题