How to store custom objects in Dataset?

前端 未结 9 1226
别那么骄傲
别那么骄傲 2020-11-22 01:53

According to Introducing Spark Datasets:

As we look forward to Spark 2.0, we plan some exciting improvements to Datasets, specifically: ... Custom

9条回答
  •  我寻月下人不归
    2020-11-22 02:11

    My examples will be in Java, but I don't imagine it to be difficult adapting to Scala.

    I have been quite successful converting RDD to Dataset using spark.createDataset and Encoders.bean as long as Fruit is a simple Java Bean.

    Step 1: Create the simple Java Bean.

    public class Fruit implements Serializable {
        private String name  = "default-fruit";
        private String color = "default-color";
    
        // AllArgsConstructor
        public Fruit(String name, String color) {
            this.name  = name;
            this.color = color;
        }
    
        // NoArgsConstructor
        public Fruit() {
            this("default-fruit", "default-color");
        }
    
        // ...create getters and setters for above fields
        // you figure it out
    }
    

    I'd stick to classes with primitive types and String as fields before the DataBricks folks beef up their Encoders. If you have a class with nested object, create another simple Java Bean with all of its fields flattened, so you can use RDD transformations to map the complex type to the simpler one. Sure it's a little extra work, but I imagine it'll help a lot on performance working with a flat schema.

    Step 2: Get your Dataset from the RDD

    SparkSession spark = SparkSession.builder().getOrCreate();
    JavaSparkContext jsc = new JavaSparkContext();
    
    List fruitList = ImmutableList.of(
        new Fruit("apple", "red"),
        new Fruit("orange", "orange"),
        new Fruit("grape", "purple"));
    JavaRDD fruitJavaRDD = jsc.parallelize(fruitList);
    
    
    RDD fruitRDD = fruitJavaRDD.rdd();
    Encoder fruitBean = Encoders.bean(Fruit.class);
    Dataset fruitDataset = spark.createDataset(rdd, bean);
    

    And voila! Lather, rinse, repeat.

提交回复
热议问题