How to modify a Spark Dataframe with a complex nested structure?

冷暖自知 提交于 2019-12-04 12:22:28

问题


I've a complex DataFrame structure and would like to null a column easily. I've created implicit classes that wire functionality and easily address 2D DataFrame structures but once the DataFrame becomes more complicated with ArrayType or MapType I've not had much luck. For example:

I have schema defined as:

StructType(
    StructField(name,StringType,true), 
    StructField(data,ArrayType(
        StructType(
            StructField(name,StringType,true), 
            StructField(values,
                MapType(StringType,StringType,true),
            true)
        ),
        true
    ),
    true)
)

I'd like to produce a new DF that has the field data.value of MapType set to null, but as this is an element of an array I have not been able to figure out how. I would think it would be similar to:

df.withColumn("data.values", functions.array(functions.lit(null)))

but this ultimately creates a new column of data.values and does not modify the values element of the data array.


回答1:


Since Spark 1.6, you can use case classes to map your dataframes (called datasets). Then, you can map your data and transform it to the new schema you want. For example:

case class Root(name: String, data: Seq[Data])
case class Data(name: String, values: Map[String, String])
case class NullableRoot(name: String, data: Seq[NullableData])
case class NullableData(name: String, value: Map[String, String], values: Map[String, String])

val nullableDF = df.as[Root].map { root =>
  val nullableData = root.data.map(data => NullableData(data.name, null, data.values))
  NullableRoot(root.name, nullableData)
}.toDF()

The resulting schema of nullableDF will be:

root
 |-- name: string (nullable = true)
 |-- data: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- value: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- values: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)


来源:https://stackoverflow.com/questions/36733640/how-to-modify-a-spark-dataframe-with-a-complex-nested-structure

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!