Spark 2.0 implicit encoder, deal with missing column when type is Option[Seq[String]] (scala)

后端 未结 1 2020
花落未央
花落未央 2020-12-18 00:37

I\'m having some trouble encoding data when some columns that are of type Option[Seq[String]] are missing from our data source. Ideally I would like the missing column data

相关标签:
1条回答
  • 2020-12-18 01:23

    In simple cases you can provide an initial schema which is a superset of expected schemas. For example in your case:

    val schema = Seq[MyType]().toDF.schema
    
    Seq("a", "b", "c").map(Option(_))
      .toDF("column1")
      .write.parquet("/tmp/column1only")
    
    val df = spark.read.schema(schema).parquet("/tmp/column1only").as[MyType]
    df.show
    
    +-------+-------+
    |column1|column2|
    +-------+-------+
    |      a|   null|
    |      b|   null|
    |      c|   null|
    +-------+-------+
    
    df.first
    
    MyType = MyType(Some(a),None)
    

    This approach can be a little bit fragile so in general you should rather use SQL literals to fill the blanks:

    spark.read.parquet("/tmp/column1only")
      // or ArrayType(StringType)
      .withColumn("column2", lit(null).cast("array<string>"))
      .as[MyType]
      .first
    
    MyType = MyType(Some(a),None)
    
    0 讨论(0)
提交回复
热议问题