How to handle missing nested fields in spark?

拥有回忆 提交于 2020-06-17 09:36:49

问题


Given the two case classes:

case class Response(
  responseField: String
  ...
  items: List[Item])

case class Item(
  itemField: String
  ...)

I am creating a Response dataset:

val dataset = spark.read.format("parquet")
                .load(inputPath)
                .as[Response]
                .map(x => x)

The issue arises when itemField is not present in any of the rows and spark will raise this error org.apache.spark.sql.AnalysisException: No such struct field itemField. If itemField was not nested I could handle it by doing dataset.withColumn("itemField", lit("")). Is it possible to do the same within the List field?


回答1:


I assume the following:

Data was written with the following schema:

case class Item(itemField: String)
case class Response(responseField: String, items: List[Item])
Seq(Response("a", List()), Response("b", List())).toDF.write.parquet("/tmp/structTest")

Now schema changed to:

case class Item(itemField: String, newField: Int)
case class Response(responseField: String, items: List[Item])
spark.read.parquet("/tmp/structTest").as[Response].map(x => x) // Fails

For Spark 2.4 please see: Spark - How to add an element to an array of structs

For Spark 2.3 this should work:

val addNewField: (Array[String], Array[Int]) => Array[Item] = (itemFields, newFields) => itemFields.zip(newFields).map { case (i, n) => Item(i, n) }

val addNewFieldUdf = udf(addNewField)
spark.read.parquet("/tmp/structTest")
   .withColumn("items", addNewFieldUdf(
      col("items.itemField") as "itemField", 
      array(lit(1)) as "newField"
   )).as[Response].map(x => x) // Works


来源:https://stackoverflow.com/questions/61919972/how-to-handle-missing-nested-fields-in-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!