handle corrupted gzip files in hadoop / hive

删除回忆录丶 提交于 2019-12-11 04:32:01

问题



I have daily folders with a lot of tar.gz files on HDFS containing a large number of text files.
A number of those tar.gz were found to be corrupted and cause hive/mapreduce jobs to crash with an "unexpected end of stream" when processing those files.

I identified a few of those and tested them with tar -zxvf. They indeed exit with an error but still extract a decent number of files before this happens.

Is there a way to stop hive/mapreduce jobs to simply crash when a tar/gz file is corrupted? I've tested some error skipping and failure tolerance parameters such as
mapred.skip.attempts.to.start.skipping,
mapred.skip.map.max.skip.records,
mapred.skip.mode.enabled,
mapred.map.max.attempts,
mapred.max.map.failures.percent,
mapreduce.map.failures.maxpercent.

It helped in a small number of cases to get a complete folder processed without crashing but mostly this caused the job to hang and not finish at all.

Unzipping every single file outside hadoop just to recompress them aftewards (to get clean gzip files) to then upload to hdfs again would be such a painful process (because of the extra steps and the large volume of data this would generate)

Is there a cleaner / more elegant solution that someone has found?

Thanks for any help.


回答1:


I'm super late to the party here, but I just faced this exact issue with corrupt gzip files. I ended up solving it by writing my own RecordReader which would catch IOExceptions, log the name of the file that had a problem, and then gracefully discard that file and move on to the next one.

I've written up some details (including code for the custom Record Reader here: http://daynebatten.com/2016/03/dealing-with-corrupt-or-blank-files-in-hadoop/




回答2:


I see essentially two ways out:

  1. You create a patch for Hadoop that allows this kind of handling of corrupted files and then simply run the applications against the corrupted files.
  2. You create a special hadoop application that uses your own custom 'gunzip' implementation (that can handle these kinds of problems). This application then simply reads and writes the files as a mapper only job (Identity mapper). The output of this job is then used as input for your normal mapreduce/pig/hive/... jobs.


来源:https://stackoverflow.com/questions/19523724/handle-corrupted-gzip-files-in-hadoop-hive

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!