Dataframe to Dataset which has type Any

后端 未结 1 390
别那么骄傲
别那么骄傲 2020-12-19 15:53

I recently moved from Spark 1.6 to Spark 2.X and I would like to move - where possible - from Dataframes to Datasets, as well. I tried a code like this



        
相关标签:
1条回答
  • 2020-12-19 16:50

    Unless you're interested in limited and ugly workarounds like Encoders.kryo:

    import org.apache.spark.sql.Encoders
    
    case class FooBar(foo: Int, bar: Any)
    
    spark.createDataset(
      sc.parallelize(Seq(FooBar(1, "a")))
    )(Encoders.kryo[FooBar])
    

    or

    spark.createDataset(
      sc.parallelize(Seq(FooBar(1, "a"))).map(x => (x.foo, x.bar))
    )(Encoders.tuple(Encoders.scalaInt, Encoders.kryo[Any]))
    

    you don't. All fields / columns in a Dataset have to be of known, homogeneous type for which there is an implicit Encoder in the scope. There is simply no place for Any there.

    UDT API provides a bit more flexibility and allows for a limited polymorphism but it is private, not fully compatible with Dataset API and comes with significant performance and storage penalty.

    If for a given execution all values of the same type you can of course create specialized classes and make a decision which one to use at run time.

    0 讨论(0)
提交回复
热议问题