I recently moved from Spark 1.6 to Spark 2.X and I would like to move - where possible - from Dataframes to Datasets, as well. I tried a code like this
case class MyClass(a : Any, ...)
val df = ...
df.map(x => MyClass(x.get(0), ...))
As you can see MyClass
has a field of type Any
, as I do not know at compile time the type of the field I retrieve with x.get(0)
. It may be a long, string, int, etc.
However, when I try to execute code similar to what you see above, I get an exception:
java.lang.ClassNotFoundException: scala.Any
With some debugging, I realized that the exception is raised, not because my data is of type Any
, but because MyClass
has a type Any
. So how can I use Datasets then?
Unless you're interested in limited and ugly workarounds like Encoders.kryo
:
import org.apache.spark.sql.Encoders
case class FooBar(foo: Int, bar: Any)
spark.createDataset(
sc.parallelize(Seq(FooBar(1, "a")))
)(Encoders.kryo[FooBar])
or
spark.createDataset(
sc.parallelize(Seq(FooBar(1, "a"))).map(x => (x.foo, x.bar))
)(Encoders.tuple(Encoders.scalaInt, Encoders.kryo[Any]))
you don't. All fields / columns in a Dataset
have to be of known, homogeneous type for which there is an implicit Encoder
in the scope. There is simply no place for Any
there.
UDT API provides a bit more flexibility and allows for a limited polymorphism but it is private, not fully compatible with Dataset
API and comes with significant performance and storage penalty.
If for a given execution all values of the same type you can of course create specialized classes and make a decision which one to use at run time.
来源:https://stackoverflow.com/questions/41504976/dataframe-to-dataset-which-has-type-any