Spark Java Map function is getting executed twice

◇◆丶佛笑我妖孽 提交于 2019-11-27 08:01:10

问题


I have above code as Spark driver, when I execute my program it works properly saving required data as Parquet file.

String indexFile = "index.txt";
JavaRDD<String> indexData = sc.textFile(indexFile).cache();
JavaRDD<String> jsonStringRDD = indexData.map(new Function<String, String>() {
  @Override
  public String call(String patientId) throws Exception {
   return "json array as string"
  }   
}); 

//1. Read json string array into a Dataframe (execution 1)
DataFrame dataSchemaDF = sqlContext.read().json(jsonStringRDD );
//2. Save dataframe as parquet file (execution 2)
dataSchemaDF.write().parquet("md.parquet");

But I observed my mapper function on RDD indexData is getting executed twice. first, when I read jsonStringRdd as DataFrame using SQLContext Second, when I write the dataSchemaDF to the parquet file

Can you guide me on this, how to avoid this repeated execution? Is there any other better way of converting JSON string into a Dataframe?


回答1:


I believe that the reason is a lack of schema for JSON reader. When you execute:

sqlContext.read().json(jsonStringRDD);

Spark has to infer schema for a newly created DataFrame. To do that it has scan input RDD and this step is performed eagerly

If you want to avoid it you have to create a StructType which describes the shape of the JSON documents:

StructType schema;
...

and use it when you create DataFrame:

DataFrame dataSchemaDF = sqlContext.read().schema(schema).json(jsonStringRDD);


来源:https://stackoverflow.com/questions/40073299/spark-java-map-function-is-getting-executed-twice

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!