Dataframe to Dataset which has type Any

I recently moved from Spark 1.6 to Spark 2.X and I would like to move - where possible - from Dataframes to Datasets, as well. I tried a code like this

case class MyClass(a : Any, ...)

val df = ...
df.map(x => MyClass(x.get(0), ...))

As you can see MyClass has a field of type Any, as I do not know at compile time the type of the field I retrieve with x.get(0). It may be a long, string, int, etc.

However, when I try to execute code similar to what you see above, I get an exception:

java.lang.ClassNotFoundException: scala.Any

With some debugging, I realized that the exception is raised, not because my data is of type Any, but because MyClass has a type Any. So how can I use Datasets then?

user6910411

Unless you're interested in limited and ugly workarounds like Encoders.kryo:

import org.apache.spark.sql.Encoders

case class FooBar(foo: Int, bar: Any)

spark.createDataset(
  sc.parallelize(Seq(FooBar(1, "a")))
)(Encoders.kryo[FooBar])

spark.createDataset(
  sc.parallelize(Seq(FooBar(1, "a"))).map(x => (x.foo, x.bar))
)(Encoders.tuple(Encoders.scalaInt, Encoders.kryo[Any]))

you don't. All fields / columns in a Dataset have to be of known, homogeneous type for which there is an implicit Encoder in the scope. There is simply no place for Any there.

UDT API provides a bit more flexibility and allows for a limited polymorphism but it is private, not fully compatible with Dataset API and comes with significant performance and storage penalty.

If for a given execution all values of the same type you can of course create specialized classes and make a decision which one to use at run time.

来源：https://stackoverflow.com/questions/41504976/dataframe-to-dataset-which-has-type-any

标签

apache-spark

dataframe

apache-spark-sql

apache-spark-dataset