Hadoop-> Mapper->How can we read only Top N rows from each file from given input path?

问题

I am new to Hadoop, My requirement is I need to process only first 10 rows from the each input file. and how to exit mapper after reading 10 rows of each file.

If anyone can provide some sample code , it would be great help.

thanks in advance.

回答1:

You can override the run method of your mapper, and once you've iterated the map loop 10 times you can break from the while loop. This will assume your files are not splitable, otherwise you'll get the first 10 lines from each split:

@Override
public void run(Context context) throws IOException, InterruptedException {
  setup(context);

  int rows = 0;
  while (context.nextKeyValue()) {
    if (rows++ == 10) {
      break;
    }

    map(context.getCurrentKey(), context.getCurrentValue(), context);
  }

  cleanup(context);
}

回答2:

suppose N = 10, then we can use the following code to read only 10 records from file below as:
line1
line2
.
.
.
line20

   //mapper
   class Mapcls extends Mapper<LongWritable, Text, Text, NullWritable> 
   {
    public void run(Context con) throws IOException, InterruptedException
    {
        setup(con);
        int rows = 0;
        while(con.nextKeyValue())
        {
            if(rows++ == 10)
            {
                break;
            }
            map(con.getCurrentKey(), con.getCurrentValue(), con);
        }

        cleanup(con);
     }

    public void map(LongWritable key, Text value, Context con) throws IOException, InterruptedException
     {
        con.write(value, NullWritable.get());
     }
    }


    //driver
    public class Testjob extends Configured implements Tool
    {

     @Override
     public int run(String[] args) throws Exception 
     {
        Configuration conf = new Configuration();
        Job job = new Job(conf, "Test-job");
        job.setJobName("tst001");
        job.setJarByClass(getClass());

        job.setMapperClass(Mapcls.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        return job.waitForCompletion(true) ? 0 : 1;
      }

      public static void main(String[] args) throws Exception
      {
        int rc = ToolRunner.run(new Configuration(), new Testjob(), args);
        System.exit(rc);
      }
    }

Then the output will be :
line1
line10
line2
line3
line4
line5
line6
line7
line8
line9

来源：https://stackoverflow.com/questions/20009648/hadoop-mapper-how-can-we-read-only-top-n-rows-from-each-file-from-given-input

标签

Hadoop

map

process

rows