Spark 2.0 implicit encoder, deal with missing column when type is Option[Seq[String]] (scala)

守給你的承諾、 提交于 2019-11-29 07:13:17

In simple cases you can provide an initial schema which is a superset of expected schemas. For example in your case:

val schema = Seq[MyType]().toDF.schema

Seq("a", "b", "c").map(Option(_))
  .toDF("column1")
  .write.parquet("/tmp/column1only")

val df = spark.read.schema(schema).parquet("/tmp/column1only").as[MyType]
df.show
+-------+-------+
|column1|column2|
+-------+-------+
|      a|   null|
|      b|   null|
|      c|   null|
+-------+-------+
df.first
MyType = MyType(Some(a),None)

This approach can be a little bit fragile so in general you should rather use SQL literals to fill the blanks:

spark.read.parquet("/tmp/column1only")
  // or ArrayType(StringType)
  .withColumn("column2", lit(null).cast("array<string>"))
  .as[MyType]
  .first
MyType = MyType(Some(a),None)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!