I have above code as Spark driver, when I execute my program it works properly saving required data as Parquet file.
String indexFile = \"index.txt\";
JavaRD
I believe that the reason is a lack of schema for JSON reader. When you execute:
sqlContext.read().json(jsonStringRDD);
Spark has to infer schema for a newly created DataFrame. To do that it has scan input RDD and this step is performed eagerly
If you want to avoid it you have to create a StructType which describes the shape of the JSON documents:
StructType schema;
...
and use it when you create DataFrame:
DataFrame dataSchemaDF = sqlContext.read().schema(schema).json(jsonStringRDD);