While writing to hdfs path getting error java.io.IOException: Failed to rename

夙愿已清 提交于 2020-06-04 04:40:47

问题


I am using spark-sql-2.4.1v which is using hadoop-2.6.5.jar version . I need to save my data first on hdfs and move to cassandra later. Hence I am trying to save the data on hdfs as below:

String hdfsPath = "/user/order_items/";
cleanedDs.createTempViewOrTable("source_tab");

givenItemList.parallelStream().forEach( item -> {   
    String query = "select $item  as itemCol , avg($item) as mean groupBy year";
    Dataset<Row> resultDs = sparkSession.sql(query);

    saveDsToHdfs(hdfsPath, resultDs );   
});


public static void saveDsToHdfs(String parquet_file, Dataset<Row> df) {
    df.write()                                 
      .format("parquet")
      .mode("append")
      .save(parquet_file);
    logger.info(" Saved parquet file :   " + parquet_file + "successfully");
}

When I run my job on cluster it fails throwing this error:

java.io.IOException: Failed to rename FileStatus{path=hdfs:/user/order_items/_temporary/0/_temporary/attempt_20180626192453_0003_m_000007_59/part-00007.parquet; isDirectory=false; length=952309; replication=1; blocksize=67108864; modification_time=1530041098000; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false} to hdfs:/user/order_items/part-00007.parquet
    at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:415)

Please suggest how to fix this issue?


回答1:


You can do all the selects in one single job, get all the selects and union in a single table.

Dataset<Row> resultDs = givenItemList.parallelStream().map( item -> {   
    String query = "select $item  as itemCol , avg($item) as mean groupBy year";
    return sparkSession.sql(query);
}).reduce((a, b) -> a.union(b)).get

saveDsToHdfs(hdfsPath, resultDs );



回答2:


The error is that you are trying to write the dataframe onto the same location for each the item in your givenItemList collection. Usually if do that it should give error

OutputDirectory already exists

But since the foreach function would execute all the items in parallel thread, you are getting this error.You can give separate directories for each thread like this

givenItemList.parallelStream().forEach( item -> {   
String query = "select $item  as itemCol , avg($item) as mean groupBy year";
Dataset<Row> resultDs = sparkSession.sql(query);
saveDsToHdfs(Strin.format("%s_item",hdfsPath), resultDs );   

});

Or else you can also have subdirectories under hdfspath like this

givenItemList.parallelStream().forEach( item -> {   
String query = "select $item  as itemCol , avg($item) as mean groupBy year";
Dataset<Row> resultDs = sparkSession.sql(query);

saveDsToHdfs(Strin.format("%s/item",hdfsPath), resultDs );   

}); `



来源:https://stackoverflow.com/questions/62036791/while-writing-to-hdfs-path-getting-error-java-io-ioexception-failed-to-rename

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!