How to specify KeyValueTextInputFormat Separator in Hadoop-.20 api?

问题

In new API (apache.hadoop.mapreduce.KeyValueTextInputFormat) , how to specify separator (delimiter) other than tab(which is default) to separate key and Value.

Sample Input :

one,first line
two,second line

Ouput Required :

Key : one
Value : first line
Key : two
Value : second line

I am specifying KeyValueTextInputFormat as :

    Job job = new Job(conf, "Sample");

    job.setInputFormatClass(KeyValueTextInputFormat.class);
    KeyValueTextInputFormat.addInputPath(job, new Path("/home/input.txt"));

This is working fine for tab as a separator.

回答1:

In the newer API you should use mapreduce.input.keyvaluelinerecordreader.key.value.separator configuration property.

Here's an example:

Configuration conf = new Configuration();
conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", ",");

Job job = new Job(conf);
job.setInputFormatClass(KeyValueTextInputFormat.class);
// next job set-up

回答2:

Please set the following in the Driver Code.

conf.set("key.value.separator.in.input.line", ",");

回答3:

For KeyValueTextInputFormat the input line should be a key value pair seperated by "\t"

Key1     Value1,Value2

By changing default seperator, You will be able to read as you wish.

For New Api

Here is the solution

//New API
Configuration conf = new Configuration();
conf.set("key.value.separator.in.input.line", ","); 
Job job = new Job(conf);
job.setInputFormatClass(KeyValueTextInputFormat.class);

Map

public class Map extends Mapper<Text, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(Text key, Text value, Context context)
        throws IOException, InterruptedException {
    String line = value.toString();
    System.out.println("key---> "+key);
    System.out.println("value---> "+value.toString());
   .
   .

Output

key---> one
value---> first line
key---> two
value---> second line

回答4:

It's a sequence matter.

The first line conf.set("key.value.separator.in.input.line", ",") must come before you create an instance of Job class. So:

conf.set("key.value.separator.in.input.line", ","); 
Job job = new Job(conf);

回答5:

First, the new API did not finished in 0.20.* so if you want to use new API in 0.20.*, you should implement the feature by yourself.For example you can use FileInputFormat to achieve. Ignore the LongWritable key, and split the Text value on comma yourself.

回答6:

By default, the KeyValueTextInputFormat class uses tab as a separator for key and value from input text file.

If you want to read the input from a custom separator, then you have to set the configuration with the attribute that you are using.

For the new Hadoop APIs, it is different:

conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", ";");

回答7:

Example

public class KeyValueTextInput extends Configured implements Tool {
    public static void main(String args[]) throws Exception {
        String log4jConfPath = "log4j.properties";
        PropertyConfigurator.configure(log4jConfPath);
        int res = ToolRunner.run(new KeyValueTextInput(), args);
        System.exit(res);
    }

    public int run(String[] args) throws Exception {

Configuration conf = this.getConf();

        //conf.set("key.value.separator.in.input.line", ",");

conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", ",");

        Job job = Job.getInstance(conf, "WordCountSampleTemplate");
        job.setJarByClass(KeyValueTextInput.class);
        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);

        //job.setMapOutputKeyClass(Text.class);
        //job.setMapOutputValueClass(Text.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        job.setInputFormatClass(KeyValueTextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        Path outputPath = new Path(args[1]);
        FileSystem fs = FileSystem.get(new URI(outputPath.toString()), conf);
        fs.delete(outputPath, true);
        FileOutputFormat.setOutputPath(job, outputPath);
        return job.waitForCompletion(true) ? 0 : 1;
    }
}

class Map extends Mapper<Text, Text, Text, Text> {
    public void map(Text k1, Text v1, Context context) throws IOException, InterruptedException {
        context.write(k1, v1);
    }
}

class Reduce extends Reducer<Text, Text, Text, Text> {
    public void reduce(Text Key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        String sum = " || ";
        for (Text value : values)
            sum = sum + value.toString() + " || ";
        context.write(Key, new Text(sum));
    }
}

来源：https://stackoverflow.com/questions/9211151/how-to-specify-keyvaluetextinputformat-separator-in-hadoop-20-api

标签

java

Hadoop

MapReduce