I have a trait called that takes a type parameter, and one of its methods needs to be able to create an empty typed dataset.
trait MyTrait[T] {
val sparkSession: SparkSession
val spark = sparkSession.session
val sparkContext = spark.sparkContext
def createEmptyDataset(): Dataset[T] = {
import spark.implicits._ // to access .toDS() function
// DOESN'T WORK.
val emptyRDD = sparkContext.parallelize(Seq[T]())
val accumulator = emptyRDD.toDS()
...
}
}
So far I have not gotten it to work. It complains no ClassTag for T
, and that value toDS is not a member of org.apache.spark.rdd.RDD[T]
Any help would be appreciated. Thanks!
You have to provide both ClassTag[T]
and Encoder[T]
in the same scope. For example:
import org.apache.spark.sql.{SparkSession, Dataset, Encoder}
import scala.reflect.ClassTag
trait MyTrait[T] {
val ct: ClassTag[T]
val enc: Encoder[T]
val sparkSession: SparkSession
val sparkContext = spark.sparkContext
def createEmptyDataset(): Dataset[T] = {
val emptyRDD = sparkContext.emptyRDD[T](ct)
spark.createDataset(emptyRDD)(enc)
}
}
with concrete implementation:
class Foo extends MyTrait[Int] {
val sparkSession = SparkSession.builder.getOrCreate()
import sparkSession.implicits._
val ct = implicitly[ClassTag[Int]]
val enc = implicitly[Encoder[Int]]
}
It is possible to skip RDD
:
import org.apache.spark.sql.{SparkSession, Dataset, Encoder}
trait MyTrait[T] {
val enc: Encoder[T]
val sparkSession: SparkSession
val sparkContext = spark.sparkContext
def createEmptyDataset(): Dataset[T] = {
spark.emptyDataset[T](enc)
}
}
Check How to declare traits as taking implicit "constructor parameters"?, specifically answer by Blaisorblade and another one by Alexey Romanov.
来源:https://stackoverflow.com/questions/47644051/spark-scala-create-empty-dataset-using-generics-in-a-trait