【MapReduce】Mapreduce基础知识整理(三)

女生的网名这么多〃 提交于 2019-12-14 23:41:52

1. 为什么需要自定义输入

我们都知道namenode负责存储文件的metadata,运行时所有数据都保存到内存,整个HDFS可存储的文件数受限于NameNode的内存大小
一个Block在NameNode中对应一条记录(一般一个block占用150字节),如果是大量的小文件,会消耗大量内存。同时map task的数量是由splits来决定的,所以用MapReduce处理大量的小文件时,就会产生过多的maptask,线程管理开销将会增加作业时间。处理大量小文件的速度远远小于处理同等大小的大文件的速度。因此Hadoop建议存储大文件。

虽然我们可以在代码中通过设置为CombineTextInputFormat

  • 但它只能在运行的时候将多个小文件加载到一个maptask中而已
  • 物理存储仍然是大量的小文件
  • hdfs的namenode压力依然很大

设置方式如:

//合并的时候  根据切片大小进行合并
job.setInputFormatClass(CombineTextInputFormat.class);
//设置切片大小  >128M
CombineTextInputFormat.setMinInputSplitSize(job, 130*1024*1024);
FileInputFormat.addInputPath(job, new Path("/in"));

那么大量小文件的解决办法:通过mapreduce进行小文件的合并
多个小文件各并为一个大文件
分析:

  • 默认的map() 方法是一行调用一次
  • 我们要自定义输入一次读到一个文件 ,一个文件调用一次map,然后直接发送给reduce端
  • reudce端将所有文件进行合并

2. 默认输入源码分析

先找到输入入口,我们执行mapper的时候,默认一行执行一次,所以我们先看一下Mapper都干了啥。

在这里插入图片描述
mapper中有三个方法

  • setup(): maptask 开始的时候执行一次
  • cleanup():maptask 结束的时候执行一次
  • map():默认一行执行一次
  • run():根据条件调用上边儿三个方法

所以下边儿代码Mapper中 run() 开始分析

2.1 org.apache.hadoop.mapreduce.Mapper

public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {

  /**
   * The <code>Context</code> passed on to the {@link Mapper} implementations.
   */
  public abstract class Context
    implements MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
  }
  
  /**
   * Called once at the beginning of the task.
   */
  protected void setup(Context context
                       ) throws IOException, InterruptedException {
    // NOTHING
  }

  /**
   * Called once for each key/value pair in the input split. Most applications
   * should override this, but the default is the identity function.
   */
  @SuppressWarnings("unchecked")
  protected void map(KEYIN key, VALUEIN value, 
                     Context context) throws IOException, InterruptedException {
    context.write((KEYOUT) key, (VALUEOUT) value);
  }

  /**
   * Called once at the end of the task.
   */
  protected void cleanup(Context context
                         ) throws IOException, InterruptedException {
    // NOTHING
  }
  
  /**
   * Expert users can override this method for more complete control over the
   * execution of the Mapper.
   * @param context
   * @throws IOException
   */
  public void run(Context context) throws IOException, InterruptedException {
    setup(context);
    try {    	
    /*
     * context.nextKeyValue()   判断是否还有下一行
     * context.getCurrentKey()  获取当前的偏移量
     * context.getCurrentValue()  获取当前行的内容
     * 核心就是找contex参数
     * 也就是谁调用了run(参数)
     */
      while (context.nextKeyValue()) {
        map(context.getCurrentKey(), context.getCurrentValue(), context);
      }
    } finally {
      cleanup(context);
    }
  }
}

我们点击run()方法找到他是在MapTask中调用的 mapper.run(mapperContext)
部分关键代码如下:

2.2 org.apache.hadoop.mapred.MapTask

// make the input format
//通过反射创建对象 (重点是对象的类型taskContext.getInputFormatClass())然后再去找这个inputformat.createRecordReader(split, taskContext)
org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE> inputFormat =
      (org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE>)
        ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job);

//this.real=inputFormat.createRecordReader(split, taskContext)
org.apache.hadoop.mapreduce.RecordReader<INKEY,INVALUE> input =
      new NewTrackingRecordReader<INKEY,INVALUE>
        (split, inputFormat, reporter, taskContext);
    
    job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());
    org.apache.hadoop.mapreduce.RecordWriter output = null;
    
    // get an output object
    if (job.getNumReduceTasks() == 0) {
      output = 
        new NewDirectOutputCollector(taskContext, job, umbilical, reporter);
    } else {
      output = new NewOutputCollector(taskContext, job, umbilical, reporter);
    }
    
org.apache.hadoop.mapreduce.MapContext<INKEY, INVALUE, OUTKEY, OUTVALUE> 
	//mapContext----》input
    mapContext = 
      new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job, getTaskID(), 
          input, output, 
          committer, 
          reporter, split);
    org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>.Context 
    	//去WrappedMapper.getMapContext看返回来的是什么
    	//mapperContext---》mapContext
        mapperContext = 
          new WrappedMapper<INKEY, INVALUE, OUTKEY, OUTVALUE>().getMapContext(
              mapContext);

    try {
      input.initialize(split, mapperContext);
      //这个mapper就是  job.setMapperClass()对应的对象
      //找mapperContext  里边肯定有这三个方法  nextKeyValue getcurrentkey  getcurrentvalue
      mapper.run(mapperContext);
      mapPhase.complete();
      setPhase(TaskStatus.Phase.SORT);
      statusUpdate(umbilical);
      input.close();
      input = null;
      output.close(mapperContext);
      output = null;
    } finally {
      closeQuietly(input);
      closeQuietly(output, mapperContext);
    }
  }

去WrappedMapper.getMapContext看返回来的是什么?
返回一个Contex对象,这个对象中有三个方法getCurrentKey getCurrentValue nextKeyValue

2.3 org.apache.hadoop.mapreduce.lib.map.WrappedMapper

/**
   * Get a wrapped {@link Mapper.Context} for custom implementations.
   * @param mapContext <code>MapContext</code> to be wrapped
   * @return a wrapped <code>Mapper.Context</code> for custom implementations
   */
  public Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context
  getMapContext(MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> mapContext) {
  	//返回一个Context对象,这个Context中应该主有我们Mapper中的三个方法nextKeyValue getcurrentkey getcurrentvalue
    return new Context(mapContext);
  }
  
  //下边儿Context中确实有getCurrentKey getCurrentValue nextKeyValue
  //这三个方法返回值来自this.mapContext=mapContext
  //而mapContex参数我们再往回找传入的地方在mapTask中创建了一个MapContextImpl对象,去看这个对象中继续看
  @InterfaceStability.Evolving
  public class Context 
      extends Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context {

    protected MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> mapContext;

	// 构造方法
    public Context(MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> mapContext) {
      this.mapContext = mapContext;
    }

    /**
     * Get the input split for this map.
     */
    public InputSplit getInputSplit() {
      return mapContext.getInputSplit();
    }

    @Override
    public KEYIN getCurrentKey() throws IOException, InterruptedException {
      return mapContext.getCurrentKey();
    }

    @Override
    public VALUEIN getCurrentValue() throws IOException, InterruptedException {
      return mapContext.getCurrentValue();
    }

    @Override
    public boolean nextKeyValue() throws IOException, InterruptedException {
      return mapContext.nextKeyValue();
    }

    @Override
    public Counter getCounter(Enum<?> counterName) {
      return mapContext.getCounter(counterName);
    }

    @Override
    public Counter getCounter(String groupName, String counterName) {
      return mapContext.getCounter(groupName, counterName);
    }

    @Override
    public OutputCommitter getOutputCommitter() {
      return mapContext.getOutputCommitter();
    }

    @Override
    public void write(KEYOUT key, VALUEOUT value) throws IOException,
        InterruptedException {
      mapContext.write(key, value);
    }

    @Override
    public String getStatus() {
      return mapContext.getStatus();
    }

    @Override
    public TaskAttemptID getTaskAttemptID() {
      return mapContext.getTaskAttemptID();
    }

    @Override
    public void setStatus(String msg) {
      mapContext.setStatus(msg);
    }

    @Override
    public Path[] getArchiveClassPaths() {
      return mapContext.getArchiveClassPaths();
    }

    @Override
    public String[] getArchiveTimestamps() {
      return mapContext.getArchiveTimestamps();
    }

    @Override
    public URI[] getCacheArchives() throws IOException {
      return mapContext.getCacheArchives();
    }

    @Override
    public URI[] getCacheFiles() throws IOException {
      return mapContext.getCacheFiles();
    }

    @Override
    public Class<? extends Reducer<?, ?, ?, ?>> getCombinerClass()
        throws ClassNotFoundException {
      return mapContext.getCombinerClass();
    }

    @Override
    public Configuration getConfiguration() {
      return mapContext.getConfiguration();
    }

    @Override
    public Path[] getFileClassPaths() {
      return mapContext.getFileClassPaths();
    }

    @Override
    public String[] getFileTimestamps() {
      return mapContext.getFileTimestamps();
    }

    @Override
    public RawComparator<?> getCombinerKeyGroupingComparator() {
      return mapContext.getCombinerKeyGroupingComparator();
    }

    @Override
    public RawComparator<?> getGroupingComparator() {
      return mapContext.getGroupingComparator();
    }

    @Override
    public Class<? extends InputFormat<?, ?>> getInputFormatClass()
        throws ClassNotFoundException {
      return mapContext.getInputFormatClass();
    }

    @Override
    public String getJar() {
      return mapContext.getJar();
    }

    @Override
    public JobID getJobID() {
      return mapContext.getJobID();
    }

    @Override
    public String getJobName() {
      return mapContext.getJobName();
    }

    @Override
    public boolean getJobSetupCleanupNeeded() {
      return mapContext.getJobSetupCleanupNeeded();
    }

    @Override
    public boolean getTaskCleanupNeeded() {
      return mapContext.getTaskCleanupNeeded();
    }

    @Override
    public Path[] getLocalCacheArchives() throws IOException {
      return mapContext.getLocalCacheArchives();
    }

    @Override
    public Path[] getLocalCacheFiles() throws IOException {
      return mapContext.getLocalCacheFiles();
    }

    @Override
    public Class<?> getMapOutputKeyClass() {
      return mapContext.getMapOutputKeyClass();
    }

    @Override
    public Class<?> getMapOutputValueClass() {
      return mapContext.getMapOutputValueClass();
    }

    @Override
    public Class<? extends Mapper<?, ?, ?, ?>> getMapperClass()
        throws ClassNotFoundException {
      return mapContext.getMapperClass();
    }

    @Override
    public int getMaxMapAttempts() {
      return mapContext.getMaxMapAttempts();
    }

    @Override
    public int getMaxReduceAttempts() {
      return mapContext.getMaxReduceAttempts();
    }

    @Override
    public int getNumReduceTasks() {
      return mapContext.getNumReduceTasks();
    }

    @Override
    public Class<? extends OutputFormat<?, ?>> getOutputFormatClass()
        throws ClassNotFoundException {
      return mapContext.getOutputFormatClass();
    }

    @Override
    public Class<?> getOutputKeyClass() {
      return mapContext.getOutputKeyClass();
    }

    @Override
    public Class<?> getOutputValueClass() {
      return mapContext.getOutputValueClass();
    }

    @Override
    public Class<? extends Partitioner<?, ?>> getPartitionerClass()
        throws ClassNotFoundException {
      return mapContext.getPartitionerClass();
    }

    @Override
    public Class<? extends Reducer<?, ?, ?, ?>> getReducerClass()
        throws ClassNotFoundException {
      return mapContext.getReducerClass();
    }

    @Override
    public RawComparator<?> getSortComparator() {
      return mapContext.getSortComparator();
    }

    @Override
    public boolean getSymlink() {
      return mapContext.getSymlink();
    }

    @Override
    public Path getWorkingDirectory() throws IOException {
      return mapContext.getWorkingDirectory();
    }

    @Override
    public void progress() {
      mapContext.progress();
    }

    @Override
    public boolean getProfileEnabled() {
      return mapContext.getProfileEnabled();
    }

    @Override
    public String getProfileParams() {
      return mapContext.getProfileParams();
    }

    @Override
    public IntegerRanges getProfileTaskRange(boolean isMap) {
      return mapContext.getProfileTaskRange(isMap);
    }

    @Override
    public String getUser() {
      return mapContext.getUser();
    }

    @Override
    public Credentials getCredentials() {
      return mapContext.getCredentials();
    }
    
    @Override
    public float getProgress() {
      return mapContext.getProgress();
    }
  }

2.4 org.apache.hadoop.mapreduce.task.MapContextImpl

从Maptask中mapContext等于new MapContextImpl
这个MapContextImpl中也有getCurrentKey() getCurrentValue() nextKeyValue() 而这几个方法中返回值是通过reader赋值的,
而reader的值是从 MapContextImpl构造方法的第三个参数 RecordReader<KEYIN,VALUEIN> reader,
所以再回到maptask看第三个参数到底是怎么来的

public class MapContextImpl<KEYIN,VALUEIN,KEYOUT,VALUEOUT> 
    extends TaskInputOutputContextImpl<KEYIN,VALUEIN,KEYOUT,VALUEOUT> 
    implements MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
  private RecordReader<KEYIN,VALUEIN> reader;
  private InputSplit split;

  public MapContextImpl(Configuration conf, TaskAttemptID taskid,
                        RecordReader<KEYIN,VALUEIN> reader,
                        RecordWriter<KEYOUT,VALUEOUT> writer,
                        OutputCommitter committer,
                        StatusReporter reporter,
                        InputSplit split) {
    super(conf, taskid, writer, committer, reporter);
    this.reader = reader;
    this.split = split;
  }

  /**
   * Get the input split for this map.
   */
  public InputSplit getInputSplit() {
    return split;
  }

  @Override
  public KEYIN getCurrentKey() throws IOException, InterruptedException {
    return reader.getCurrentKey();
  }

  @Override
  public VALUEIN getCurrentValue() throws IOException, InterruptedException {
    return reader.getCurrentValue();
  }

  @Override
  public boolean nextKeyValue() throws IOException, InterruptedException {
    return reader.nextKeyValue();
  }

}
   

2.5 org.apache.hadoop.mapred.MapTask.NewTrackingRecordReader

maptask的input =
new NewTrackingRecordReader<INKEY,INVALUE>
(split, inputFormat, reporter, taskContext);
所以我们到NewTrackingRecordReader对象中看他的三个方法getCurrentKey() getCurrentValue() nextKeyValue()
这三个方法的返回值来自org.apache.hadoop.mapreduce.RecordReader<K,V> real,
this.real = inputFormat.createRecordReader(split, taskContext);
而inputFormat来自NewTrackingRecordReader构造参数的第二个参数
所以需要再次回到maptask找inputformat

static class NewTrackingRecordReader<K,V> 
    extends org.apache.hadoop.mapreduce.RecordReader<K,V> {
    private final org.apache.hadoop.mapreduce.RecordReader<K,V> real;
    private final org.apache.hadoop.mapreduce.Counter inputRecordCounter;
    private final org.apache.hadoop.mapreduce.Counter fileInputByteCounter;
    private final TaskReporter reporter;
    private final List<Statistics> fsStats;
    
    NewTrackingRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
        org.apache.hadoop.mapreduce.InputFormat<K, V> inputFormat,
        TaskReporter reporter,
        org.apache.hadoop.mapreduce.TaskAttemptContext taskContext)
        throws InterruptedException, IOException {
      this.reporter = reporter;
      this.inputRecordCounter = reporter
          .getCounter(TaskCounter.MAP_INPUT_RECORDS);
      this.fileInputByteCounter = reporter
          .getCounter(FileInputFormatCounter.BYTES_READ);

      List <Statistics> matchedStats = null;
      if (split instanceof org.apache.hadoop.mapreduce.lib.input.FileSplit) {
        matchedStats = getFsStatistics(((org.apache.hadoop.mapreduce.lib.input.FileSplit) split)
            .getPath(), taskContext.getConfiguration());
      }
      fsStats = matchedStats;

      long bytesInPrev = getInputBytes(fsStats);
      this.real = inputFormat.createRecordReader(split, taskContext);
      long bytesInCurr = getInputBytes(fsStats);
      fileInputByteCounter.increment(bytesInCurr - bytesInPrev);
    }

    @Override
    public void close() throws IOException {
      long bytesInPrev = getInputBytes(fsStats);
      real.close();
      long bytesInCurr = getInputBytes(fsStats);
      fileInputByteCounter.increment(bytesInCurr - bytesInPrev);
    }

    @Override
    public K getCurrentKey() throws IOException, InterruptedException {
      return real.getCurrentKey();
    }

    @Override
    public V getCurrentValue() throws IOException, InterruptedException {
      return real.getCurrentValue();
    }

    @Override
    public float getProgress() throws IOException, InterruptedException {
      return real.getProgress();
    }

    @Override
    public void initialize(org.apache.hadoop.mapreduce.InputSplit split,
                           org.apache.hadoop.mapreduce.TaskAttemptContext context
                           ) throws IOException, InterruptedException {
      long bytesInPrev = getInputBytes(fsStats);
      real.initialize(split, context);
      long bytesInCurr = getInputBytes(fsStats);
      fileInputByteCounter.increment(bytesInCurr - bytesInPrev);
    }

    @Override
    public boolean nextKeyValue() throws IOException, InterruptedException {
      long bytesInPrev = getInputBytes(fsStats);
      boolean result = real.nextKeyValue();
      long bytesInCurr = getInputBytes(fsStats);
      if (result) {
        inputRecordCounter.increment(1);
      }
      fileInputByteCounter.increment(bytesInCurr - bytesInPrev);
      reporter.setProgress(getProgress());
      return result;
    }

    private long getInputBytes(List<Statistics> stats) {
      if (stats == null) return 0;
      long bytesRead = 0;
      for (Statistics stat: stats) {
        bytesRead = bytesRead + stat.getBytesRead();
      }
      return bytesRead;
    }
  }
  

2.6 org.apache.hadoop.mapreduce.JobContext#getInputFormatClass

从maptask中
inputFormat =
(org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE>)
ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job);
发现这是一个抽象类,而且返回类型是InputFormat的一个子类
我们继结找实现方法getInputFormatClass的返回类型为InputFormat的子类的实现类

/**
   * Get the {@link InputFormat} class for the job.
   * 
   * @return the {@link InputFormat} class for the job.
   */
  public Class<? extends InputFormat<?,?>> getInputFormatClass() 
     throws ClassNotFoundException;

2.7 org.apache.hadoop.mapreduce.task.JobContextImpl#getInputFormatClass

/**
   * Get the {@link InputFormat} class for the job.
   * 
   * @return the {@link InputFormat} class for the job.
   */
  @SuppressWarnings("unchecked")
  public Class<? extends InputFormat<?,?>> getInputFormatClass() 
     throws ClassNotFoundException {
    return (Class<? extends InputFormat<?,?>>) 
      //conf —————— job.xml
      //public static final String INPUT_FORMAT_CLASS_ATTR = "mapreduce.job.inputformat.class";
      //去mapper-default.xml中查找发现没有mapreduce.job.inputformat.class,所以getClass返回第二个参数TextInputFormat.class
      //去TextInputFormat中找createRecordReader返回了什么
      conf.getClass(INPUT_FORMAT_CLASS_ATTR, TextInputFormat.class);
  }
  

2.8 org.apache.hadoop.mapreduce.lib.input.TextInputFormat

可以看到createRecordReader返回new LineRecordReader(recordDelimiterBytes),
所以去LineRecordReader中应该就能找到nextKeyValue() getCurrentKey() getCurrentValue() 真正实现

public class TextInputFormat extends FileInputFormat<LongWritable, Text> {

  @Override
  public RecordReader<LongWritable, Text> 
    createRecordReader(InputSplit split,
                       TaskAttemptContext context) {
    String delimiter = context.getConfiguration().get(
        "textinputformat.record.delimiter");
    byte[] recordDelimiterBytes = null;
    if (null != delimiter)
      recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
    return new LineRecordReader(recordDelimiterBytes);
  }

  @Override
  protected boolean isSplitable(JobContext context, Path file) {
    final CompressionCodec codec =
      new CompressionCodecFactory(context.getConfiguration()).getCodec(file);
    if (null == codec) {
      return true;
    }
    return codec instanceof SplittableCompressionCodec;
  }

}

2.9 org.apache.hadoop.mapreduce.lib.input.LineRecordReader

这个里边儿就是真正实现怎么一行一行读取了

public class LineRecordReader extends RecordReader<LongWritable, Text> {
  private static final Log LOG = LogFactory.getLog(LineRecordReader.class);
  public static final String MAX_LINE_LENGTH = 
    "mapreduce.input.linerecordreader.line.maxlength";

  private long start;
  //位置   当前的行首的偏移量
  private long pos;
  private long end;
  private SplitLineReader in;
  private FSDataInputStream fileIn;
  private Seekable filePosition;
  private int maxLineLength;
  //每一行的起始偏移量
  private LongWritable key;
  //每一行的内容
  private Text value;
  private boolean isCompressedInput;
  private Decompressor decompressor;
  private byte[] recordDelimiterBytes;

  public LineRecordReader() {
  }

  public LineRecordReader(byte[] recordDelimiter) {
    this.recordDelimiterBytes = recordDelimiter;
  }

  public void initialize(InputSplit genericSplit,
                         TaskAttemptContext context) throws IOException {
    FileSplit split = (FileSplit) genericSplit;
    Configuration job = context.getConfiguration();
    this.maxLineLength = job.getInt(MAX_LINE_LENGTH, Integer.MAX_VALUE);
    start = split.getStart();
    end = start + split.getLength();
    final Path file = split.getPath();

    // open the file and seek to the start of the split
    final FileSystem fs = file.getFileSystem(job);
    fileIn = fs.open(file);
    
    CompressionCodec codec = new CompressionCodecFactory(job).getCodec(file);
    if (null!=codec) {
      isCompressedInput = true;	
      decompressor = CodecPool.getDecompressor(codec);
      if (codec instanceof SplittableCompressionCodec) {
        final SplitCompressionInputStream cIn =
          ((SplittableCompressionCodec)codec).createInputStream(
            fileIn, decompressor, start, end,
            SplittableCompressionCodec.READ_MODE.BYBLOCK);
        in = new CompressedSplitLineReader(cIn, job,
            this.recordDelimiterBytes);
        start = cIn.getAdjustedStart();
        end = cIn.getAdjustedEnd();
        filePosition = cIn;
      } else {
        in = new SplitLineReader(codec.createInputStream(fileIn,
            decompressor), job, this.recordDelimiterBytes);
        filePosition = fileIn;
      }
    } else {
      fileIn.seek(start);
      in = new UncompressedSplitLineReader(
          fileIn, job, this.recordDelimiterBytes, split.getLength());
      filePosition = fileIn;
    }
    // If this is not the first split, we always throw away first record
    // because we always (except the last split) read one extra line in
    // next() method.
    if (start != 0) {
      start += in.readLine(new Text(), 0, maxBytesToConsume(start));
    }
    this.pos = start;
  }
  

  private int maxBytesToConsume(long pos) {
    return isCompressedInput
      ? Integer.MAX_VALUE
      : (int) Math.max(Math.min(Integer.MAX_VALUE, end - pos), maxLineLength);
  }

  private long getFilePosition() throws IOException {
    long retVal;
    if (isCompressedInput && null != filePosition) {
      retVal = filePosition.getPos();
    } else {
      retVal = pos;
    }
    return retVal;
  }

  private int skipUtfByteOrderMark() throws IOException {
    // Strip BOM(Byte Order Mark)
    // Text only support UTF-8, we only need to check UTF-8 BOM
    // (0xEF,0xBB,0xBF) at the start of the text stream.
    int newMaxLineLength = (int) Math.min(3L + (long) maxLineLength,
        Integer.MAX_VALUE);
    int newSize = in.readLine(value, newMaxLineLength, maxBytesToConsume(pos));
    // Even we read 3 extra bytes for the first line,
    // we won't alter existing behavior (no backwards incompat issue).
    // Because the newSize is less than maxLineLength and
    // the number of bytes copied to Text is always no more than newSize.
    // If the return size from readLine is not less than maxLineLength,
    // we will discard the current line and read the next line.
    pos += newSize;
    int textLength = value.getLength();
    byte[] textBytes = value.getBytes();
    if ((textLength >= 3) && (textBytes[0] == (byte)0xEF) &&
        (textBytes[1] == (byte)0xBB) && (textBytes[2] == (byte)0xBF)) {
      // find UTF-8 BOM, strip it.
      LOG.info("Found UTF-8 BOM and skipped it");
      textLength -= 3;
      newSize -= 3;
      if (textLength > 0) {
        // It may work to use the same buffer and not do the copyBytes
        textBytes = value.copyBytes();
        value.set(textBytes, 3, textLength);
      } else {
        value.clear();
      }
    }
    return newSize;
  }

  public boolean nextKeyValue() throws IOException {
	//判断key==null   进行key的初始化
    if (key == null) {
      key = new LongWritable();
    }
    //将偏移量赋值给key  pos初始0
    key.set(pos);
    //pos+=bytes  5   hello
    
    //对value初始化
    if (value == null) {
      value = new Text();
    }
    //统计当前行的字节
    int newSize = 0;
    // We always read one extra line, which lies outside the upper
    // split limit i.e. (end - 1)
    while (getFilePosition() <= end || in.needAdditionalRecordAfterSplit()) {
    //给newSize  赋值
      if (pos == 0) {
        newSize = skipUtfByteOrderMark();
      } else {
        newSize = in.readLine(value, maxLineLength, maxBytesToConsume(pos));
        //给pos重新赋值
        pos += newSize;
      }

      if ((newSize == 0) || (newSize < maxLineLength)) {
        break;
      }

      // line too long. try again
      LOG.info("Skipped line of size " + newSize + " at pos " + 
               (pos - newSize));
    }
    //切片读取完成
    if (newSize == 0) {
    //提醒垃圾回收
      key = null;
      value = null;
      return false;
    } else {
      return true;
    }
  }

  //返回的就是属性
  //返回的就是当前行的起始偏移量
  @Override
  public LongWritable getCurrentKey() {
    return key;
  }

  //代表的是一行的内容
  @Override
  public Text getCurrentValue() {
    return value;
  }

  /**
   * Get the progress within the split
   */
  public float getProgress() throws IOException {
    if (start == end) {
      return 0.0f;
    } else {
      return Math.min(1.0f, (getFilePosition() - start) / (float)(end - start));
    }
  }
  
  public synchronized void close() throws IOException {
    try {
      if (in != null) {
        in.close();
      }
    } finally {
      if (decompressor != null) {
        CodecPool.returnDecompressor(decompressor);
        decompressor = null;
      }
    }
  }
}

2.10 源码总结

整个流程走下来,我们首先要抓住线索,就是不断的去找到底哪一步真正实现了getCurrentKey(),getCurrentValue(),nextKeyValue(),
最终找到TextInputFormat.createRecordReader()返回一个LineRecordReader中实现的。
我们自定义输入也按着这个套路写就对了。

public void run(Context context) throws IOException, InterruptedException {
    setup(context);
    try {
      while (context.nextKeyValue()) {
        map(context.getCurrentKey(), context.getCurrentValue(), context);
      }
    } finally {
      cleanup(context);
    }
  }

默认的输入:

  • FileInputFormat
    • TextInputFormat
      • RecordReader
        • LineRecordReader

3. 自定义输入

现在分析大量小文件合并一个大文件
自定义输入:

  • 创建一个类继承FileInputFormat
    重写createRecordReader()
  • 创建一个文件真正的读取器,继承RecordReader
    重写getcurrentkey() getcurrentvalue() nextkeyvalue()
  • job中指定自定义的输入类
    job.setInputFormatClass(MyFileInputFormat.class);

要求:
三个小文件1.txt 2.txt 3.txt
合并成一个大文件

MyFileInputFormat.java

/**
 * 泛型指的是  读取的key  value的类型
 * 读取之后  mapper的输入
 *
 * 每次读取一个文件
 * 	文件内容   Text
 * 	这里key可以为Null
 */
public class MyFileInputFormat extends FileInputFormat<NullWritable, Text> {
    /**
     * 获取文件读取器
     *
     * FileInputFormat.addInputPath(job,"")  job.xml
     *
     * InputSplit split, TaskAttemptContext context
     */
    public RecordReader<NullWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
        MyRecordReader mr = new MyRecordReader();
        //传参
        mr.initialize(split,context);
        return mr;
    }
}

MyRecordReader.java

/**
 * 文件读取器  核心的进行文件读取
 * 创建一个流
 * 		hdfs的流
 * 		FileSystem  fs
 * 		fs.open(path)
 * 注意
 * 	进行文件读取的时候,首先就是进入nextKeyValue,判断有没有内容要读取,然后才会getCurrentKey() getCurrentValue()
 */
public class MyRecordReader extends RecordReader<NullWritable, Text> {
    FileSystem fs;
    int lenth;
    FSDataInputStream fsDataInputStream;
    Text value=new Text();
    //属性   判断是否读取完成  默认false  true--读取完成   false--没有读完
    boolean isReader;


    /**
     * 初始化 创建hdfs的输入流
     * @param split 输入的切片
     * @param context task的上下文对象
     * @throws IOException
     * @throws InterruptedException
     */
    public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
        //初始化fs对象  context.getConfiguration() 获取配置文件
        fs = FileSystem.get(context.getConfiguration());
        //获取文件路径
        FileSplit fsplit = (FileSplit) split;
        Path path = fsplit.getPath();
        //获取文件的实际长度
        lenth = (int) fsplit.getLength();
        //创建流
        fsDataInputStream = fs.open(path);
    }

    /**
     * 判断当前文件切片是否还有要读取的内容
     * @return 代表是否继续读取,false表示没有读完,true表示读取完成
     * @throws IOException
     * @throws InterruptedException
     */
    public boolean nextKeyValue() throws IOException, InterruptedException {
        if(!isReader){//如果还有内容要读
            //读取的内容
            byte[] buf = new byte[lenth];
            fsDataInputStream.readFully(buf,0,lenth);
            //将读取的内容放到value中
            value.set(buf);
            //标记是否读取完成,我们一次读取完了一个文件,所以是
            isReader = true;
            return true;
        }else {
            return false;
        }
    }

    public NullWritable getCurrentKey() throws IOException, InterruptedException {
        return NullWritable.get();
    }

    public Text getCurrentValue() throws IOException, InterruptedException {
        return this.value;
    }

    public float getProgress() throws IOException, InterruptedException {
        //文件要么读取完成,要么没读
        return isReader?1.0f:0.0f;
    }

    /**
     * 关闭流
     * @throws IOException
     */
    public void close() throws IOException {
        if(fsDataInputStream!=null){
            fsDataInputStream.close();
        }
        if(fs!=null){
            fs.close();
        }
    }
}

MergeFiles.java

/**
 * 默认的map() 方法是一行调用一次
 * 我们要自定义输入一次读到一个文件 ,一个文件调用一次map,然后直接发送给reduce端
 * reudce端将所有文件进行合并
 */
public class MergeFiles {


    static class MergeFilesMapper extends Mapper<NullWritable, Text, Text, NullWritable> {
        //读一个文件调用一次
        @Override
        protected void map(NullWritable key, Text value, Context context) throws IOException, InterruptedException {
            //将读取的文件内容,直接发送给reduce端
            context.write(value, NullWritable.get());
        }
    }

    static class MergeFilesReducer extends Reducer<Text, NullWritable, Text, NullWritable> {
        @Override
        protected void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
            for (NullWritable v : values) {
                context.write(key,NullWritable.get());
            }
        }
    }

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        System.setProperty("HADOOP_USER_NAME","hdp01");
        Configuration conf = new Configuration();
        conf.set("mapperduce.framework.name","local");
        conf.set("fs.defaultFS","hdfs://10.211.55.20:9000");

        Job job = Job.getInstance(conf);

        job.setJarByClass(FlowSort2.class);
        job.setMapperClass(MergeFilesMapper.class);
        job.setReducerClass(MergeFilesReducer.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);

        //指定自定义输入类
        job.setInputFormatClass(MyFileInputFormat.class);

        FileInputFormat.addInputPath(job,new Path("/tmpin/invetedIndex"));

        FileSystem fs= FileSystem.get(conf);
        Path outPath = new Path("/tmpout/mergeFiles");
        if(fs.exists(outPath)){//存在  删除
            fs.delete(outPath,true);
        }
        FileOutputFormat.setOutputPath(job,outPath);

        job.waitForCompletion(true);

    }
}

输入文件:

[hdp01@hdp01 tmpfiles]$ cat 1.txt 
A friend in need is a friend indeed
Good is good but better carries it
[hdp01@hdp01 tmpfiles]$ cat 2.txt 
A good name is better than riches
Time is a bird for ever on the wing
Adversity is a good disciple
[hdp01@hdp01 tmpfiles]$ cat 3.txt 
Doubt is the key to knowledge

输出文件:

[hdp01@hdp01 tmpfiles]$ hdfs dfs -cat /tmpout/mergeFiles/part-r-00000
A friend in need is a friend indeed
Good is good but better carries it

A good name is better than riches
Time is a bird for ever on the wing
Adversity is a good disciple

Doubt is the key to knowledge
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!