Hadoop-> Mapper->How can we read only Top N rows from each file from given input path?

两盒软妹~` 提交于 2020-01-06 03:18:08

问题


I am new to Hadoop, My requirement is I need to process only first 10 rows from the each input file. and how to exit mapper after reading 10 rows of each file.

If anyone can provide some sample code , it would be great help.

thanks in advance.


回答1:


You can override the run method of your mapper, and once you've iterated the map loop 10 times you can break from the while loop. This will assume your files are not splitable, otherwise you'll get the first 10 lines from each split:

@Override
public void run(Context context) throws IOException, InterruptedException {
  setup(context);

  int rows = 0;
  while (context.nextKeyValue()) {
    if (rows++ == 10) {
      break;
    }

    map(context.getCurrentKey(), context.getCurrentValue(), context);
  }

  cleanup(context);
}



回答2:


suppose N = 10, then we can use the following code to read only 10 records from file below as:
line1
line2
.
.
.
line20

   //mapper
   class Mapcls extends Mapper<LongWritable, Text, Text, NullWritable> 
   {
    public void run(Context con) throws IOException, InterruptedException
    {
        setup(con);
        int rows = 0;
        while(con.nextKeyValue())
        {
            if(rows++ == 10)
            {
                break;
            }
            map(con.getCurrentKey(), con.getCurrentValue(), con);
        }

        cleanup(con);
     }

    public void map(LongWritable key, Text value, Context con) throws IOException, InterruptedException
     {
        con.write(value, NullWritable.get());
     }
    }


    //driver
    public class Testjob extends Configured implements Tool
    {

     @Override
     public int run(String[] args) throws Exception 
     {
        Configuration conf = new Configuration();
        Job job = new Job(conf, "Test-job");
        job.setJobName("tst001");
        job.setJarByClass(getClass());

        job.setMapperClass(Mapcls.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        return job.waitForCompletion(true) ? 0 : 1;
      }

      public static void main(String[] args) throws Exception
      {
        int rc = ToolRunner.run(new Configuration(), new Testjob(), args);
        System.exit(rc);
      }
    }

Then the output will be :
line1
line10
line2
line3
line4
line5
line6
line7
line8
line9



来源:https://stackoverflow.com/questions/20009648/hadoop-mapper-how-can-we-read-only-top-n-rows-from-each-file-from-given-input

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!