问题
I have a pyspark dataframe which contains a column "student" as follows:
"student" : {
"name" : "kaleem",
"rollno" : "12"
}
Schema for this in dataframe is :
structType(List(
name: String,
rollno: String))
I need to modify this column as
"student" : {
"student_details" : {
"name" : "kaleem",
"rollno" : "12"
}
}
Schema for this in dataframe must be :
structType(List(
student_details:
structType(List(
name: String,
rollno: String))
))
How to do this in spark?
回答1:
Use named_struct function to achieve this-
1. Read the json as column
val data =
"""
| {
| "student": {
| "name": "kaleem",
| "rollno": "12"
| }
|}
""".stripMargin
val df = spark.read.json(Seq(data).toDS())
df.show(false)
println(df.schema("student"))
Output-
+------------+
|student |
+------------+
|[kaleem, 12]|
+------------+
StructField(student,StructType(StructField(name,StringType,true), StructField(rollno,StringType,true)),true)
2. change the schema using named_struct
val processedDf = df.withColumn("student",
expr("named_struct('student_details', student)")
)
processedDf.show(false)
println(processedDf.schema("student"))
Output-
+--------------+
|student |
+--------------+
|[[kaleem, 12]]|
+--------------+
StructField(student,StructType(StructField(student_details,StructType(StructField(name,StringType,true), StructField(rollno,StringType,true)),true)),false)
For python
step#2
will work as is just remove val
回答2:
With a library called spark-hats - This library extends Spark DataFrame API with helpers for transforming fields inside nested structures and arrays of arbitrary levels of nesting., you can do a lot of these transformations.
scala> import za.co.absa.spark.hats.Extensions._
scala> df.printSchema
root
|-- ID: string (nullable = true)
scala> val df2 = df.nestedMapColumn("ID", "ID", c => struct(c as alfa))
scala> df2.printSchema
root
|-- ID: struct (nullable = false)
| |-- alfa: string (nullable = true)
scala> val df3 = df2.nestedMapColumn("ID.alfa", "ID.alfa", c => struct(c as "beta"))
scala> df3.printSchema
root
|-- ID: struct (nullable = false)
| |-- alfa: struct (nullable = false)
| | |-- beta: string (nullable = true)
Your query would be
df.nestedMapColumn("student", "student", c => struct(c as "student_details"))
来源:https://stackoverflow.com/questions/62050145/modify-a-struct-column-in-spark-dataframe