According to Introducing Spark Datasets:
As we look forward to Spark 2.0, we plan some exciting improvements to Datasets, specifically: ... Custom
My examples will be in Java, but I don't imagine it to be difficult adapting to Scala.
I have been quite successful converting RDD to Dataset using spark.createDataset and Encoders.bean as long as Fruit is a simple Java Bean.
Step 1: Create the simple Java Bean.
public class Fruit implements Serializable {
private String name = "default-fruit";
private String color = "default-color";
// AllArgsConstructor
public Fruit(String name, String color) {
this.name = name;
this.color = color;
}
// NoArgsConstructor
public Fruit() {
this("default-fruit", "default-color");
}
// ...create getters and setters for above fields
// you figure it out
}
I'd stick to classes with primitive types and String as fields before the DataBricks folks beef up their Encoders. If you have a class with nested object, create another simple Java Bean with all of its fields flattened, so you can use RDD transformations to map the complex type to the simpler one. Sure it's a little extra work, but I imagine it'll help a lot on performance working with a flat schema.
Step 2: Get your Dataset from the RDD
SparkSession spark = SparkSession.builder().getOrCreate();
JavaSparkContext jsc = new JavaSparkContext();
List fruitList = ImmutableList.of(
new Fruit("apple", "red"),
new Fruit("orange", "orange"),
new Fruit("grape", "purple"));
JavaRDD fruitJavaRDD = jsc.parallelize(fruitList);
RDD fruitRDD = fruitJavaRDD.rdd();
Encoder fruitBean = Encoders.bean(Fruit.class);
Dataset fruitDataset = spark.createDataset(rdd, bean);
And voila! Lather, rinse, repeat.