Spark Java Map function is getting executed twice

后端 未结 1 888
悲哀的现实
悲哀的现实 2020-12-11 20:07

I have above code as Spark driver, when I execute my program it works properly saving required data as Parquet file.

String indexFile = \"index.txt\";
JavaRD         


        
相关标签:
1条回答
  • 2020-12-11 20:43

    I believe that the reason is a lack of schema for JSON reader. When you execute:

    sqlContext.read().json(jsonStringRDD);
    

    Spark has to infer schema for a newly created DataFrame. To do that it has scan input RDD and this step is performed eagerly

    If you want to avoid it you have to create a StructType which describes the shape of the JSON documents:

    StructType schema;
    ...
    

    and use it when you create DataFrame:

    DataFrame dataSchemaDF = sqlContext.read().schema(schema).json(jsonStringRDD);
    
    0 讨论(0)
提交回复
热议问题