Hadoop options are not having any effect (mapreduce.input.lineinputformat.linespermap, mapred.max.map.failures.percent)

丶灬走出姿态 提交于 2019-12-30 10:43:22

问题


I am trying to implement a MapReduce job, where each of the mappers would take 150 lines of the text file, and all the mappers would run simmultaniously; also, it should not fail, no matter how many map tasks fail.

Here's the configuration part:

        JobConf conf = new JobConf(Main.class);
        conf.setJobName("My mapreduce");

        conf.set("mapreduce.input.lineinputformat.linespermap", "150");
        conf.set("mapred.max.map.failures.percent","100");

        conf.setInputFormat(NLineInputFormat.class);

        FileInputFormat.addInputPath(conf, new Path(args[0]));
        FileOutputFormat.setOutputPath(conf, new Path(args[1]));

The problem is that hadoop creates a mapper for every single line of text, they seem to run sequentially, and if a single one fails, the job fails.

From this I deduce, that the settings I've applied do not have any effect.

What did I do wrong?


回答1:


I assume you are using Hadoop 0.20. In 0.20 the configuration parameter is "mapred.line.input.format.linespermap" and you are using "mapreduce.input.lineinputformat.linespermap". If the configuration parameter is not set then it's defaulted to 1, so you so you are seeing the behavior mentioned in the query.

Here is the code snippet from 0.20 NLineInputFormat.

public void configure(JobConf conf) { N = conf.getInt("mapred.line.input.format.linespermap", 1); }

Hadoop configuration is sometimes a real pain, not documented properly and I have observed that the configuration parameter also keeps changing sometimes between releases. The best bet is to see the code when uncertain of some configuration parameters.




回答2:


To start with "mapred." is old api and "mapreduce." is new api. so you would better not use them together. check which version you are using and stick with that. And also recheck your imports, since there are 2 NLineInputFormat aswell (mapred and mapreduce).

Secondly you can check this link : (gonna paste the important part)

NLineInputFormat will split N lines of input as one split. So, each map gets N lines.

But the RecordReader is still LineRecordReader, which reads one line at time, thereby Key is the offset in the file and Value is the line. If you want N lines as Key, you may to override LineRecordReader.




回答3:


If you want to quickly find the correct names for the options for hadoop's new api, use this link: http://pydoop.sourceforge.net/docs/examples/intro.html#hadoop-0-21-0-notes .




回答4:


The new api's options are mostly undocumented



来源:https://stackoverflow.com/questions/7457292/hadoop-options-are-not-having-any-effect-mapreduce-input-lineinputformat-linesp

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!