目录
- 1. 为什么需要自定义输入
- 2. 默认输入源码分析
- 2.1 org.apache.hadoop.mapreduce.Mapper
- 2.2 org.apache.hadoop.mapred.MapTask
- 2.3 org.apache.hadoop.mapreduce.lib.map.WrappedMapper
- 2.4 org.apache.hadoop.mapreduce.task.MapContextImpl
- 2.5 org.apache.hadoop.mapred.MapTask.NewTrackingRecordReader
- 2.6 org.apache.hadoop.mapreduce.JobContext#getInputFormatClass
- 2.7 org.apache.hadoop.mapreduce.task.JobContextImpl#getInputFormatClass
- 2.8 org.apache.hadoop.mapreduce.lib.input.TextInputFormat
- 2.9 org.apache.hadoop.mapreduce.lib.input.LineRecordReader
- 2.10 源码总结
- 3. 自定义输入
1. 为什么需要自定义输入
我们都知道namenode负责存储文件的metadata,运行时所有数据都保存到内存,整个HDFS可存储的文件数受限于NameNode的内存大小
一个Block在NameNode中对应一条记录(一般一个block占用150字节),如果是大量的小文件,会消耗大量内存。同时map task的数量是由splits来决定的,所以用MapReduce处理大量的小文件时,就会产生过多的maptask,线程管理开销将会增加作业时间。处理大量小文件的速度远远小于处理同等大小的大文件的速度。因此Hadoop建议存储大文件。
虽然我们可以在代码中通过设置为CombineTextInputFormat
- 但它只能在运行的时候将多个小文件加载到一个maptask中而已
- 物理存储仍然是大量的小文件
- hdfs的namenode压力依然很大
设置方式如:
//合并的时候 根据切片大小进行合并
job.setInputFormatClass(CombineTextInputFormat.class);
//设置切片大小 >128M
CombineTextInputFormat.setMinInputSplitSize(job, 130*1024*1024);
FileInputFormat.addInputPath(job, new Path("/in"));
那么大量小文件的解决办法:通过mapreduce进行小文件的合并
多个小文件各并为一个大文件
分析:
- 默认的map() 方法是一行调用一次
- 我们要自定义输入一次读到一个文件 ,一个文件调用一次map,然后直接发送给reduce端
- reudce端将所有文件进行合并
2. 默认输入源码分析
先找到输入入口,我们执行mapper的时候,默认一行执行一次,所以我们先看一下Mapper都干了啥。
mapper中有三个方法
- setup(): maptask 开始的时候执行一次
- cleanup():maptask 结束的时候执行一次
- map():默认一行执行一次
- run():根据条件调用上边儿三个方法
所以下边儿代码Mapper中 run() 开始分析
2.1 org.apache.hadoop.mapreduce.Mapper
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
/**
* The <code>Context</code> passed on to the {@link Mapper} implementations.
*/
public abstract class Context
implements MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
}
/**
* Called once at the beginning of the task.
*/
protected void setup(Context context
) throws IOException, InterruptedException {
// NOTHING
}
/**
* Called once for each key/value pair in the input split. Most applications
* should override this, but the default is the identity function.
*/
@SuppressWarnings("unchecked")
protected void map(KEYIN key, VALUEIN value,
Context context) throws IOException, InterruptedException {
context.write((KEYOUT) key, (VALUEOUT) value);
}
/**
* Called once at the end of the task.
*/
protected void cleanup(Context context
) throws IOException, InterruptedException {
// NOTHING
}
/**
* Expert users can override this method for more complete control over the
* execution of the Mapper.
* @param context
* @throws IOException
*/
public void run(Context context) throws IOException, InterruptedException {
setup(context);
try {
/*
* context.nextKeyValue() 判断是否还有下一行
* context.getCurrentKey() 获取当前的偏移量
* context.getCurrentValue() 获取当前行的内容
* 核心就是找contex参数
* 也就是谁调用了run(参数)
*/
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
} finally {
cleanup(context);
}
}
}
我们点击run()方法找到他是在MapTask中调用的 mapper.run(mapperContext)
部分关键代码如下:
2.2 org.apache.hadoop.mapred.MapTask
// make the input format
//通过反射创建对象 (重点是对象的类型taskContext.getInputFormatClass())然后再去找这个inputformat.createRecordReader(split, taskContext)
org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE> inputFormat =
(org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE>)
ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job);
//this.real=inputFormat.createRecordReader(split, taskContext)
org.apache.hadoop.mapreduce.RecordReader<INKEY,INVALUE> input =
new NewTrackingRecordReader<INKEY,INVALUE>
(split, inputFormat, reporter, taskContext);
job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());
org.apache.hadoop.mapreduce.RecordWriter output = null;
// get an output object
if (job.getNumReduceTasks() == 0) {
output =
new NewDirectOutputCollector(taskContext, job, umbilical, reporter);
} else {
output = new NewOutputCollector(taskContext, job, umbilical, reporter);
}
org.apache.hadoop.mapreduce.MapContext<INKEY, INVALUE, OUTKEY, OUTVALUE>
//mapContext----》input
mapContext =
new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job, getTaskID(),
input, output,
committer,
reporter, split);
org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>.Context
//去WrappedMapper.getMapContext看返回来的是什么
//mapperContext---》mapContext
mapperContext =
new WrappedMapper<INKEY, INVALUE, OUTKEY, OUTVALUE>().getMapContext(
mapContext);
try {
input.initialize(split, mapperContext);
//这个mapper就是 job.setMapperClass()对应的对象
//找mapperContext 里边肯定有这三个方法 nextKeyValue getcurrentkey getcurrentvalue
mapper.run(mapperContext);
mapPhase.complete();
setPhase(TaskStatus.Phase.SORT);
statusUpdate(umbilical);
input.close();
input = null;
output.close(mapperContext);
output = null;
} finally {
closeQuietly(input);
closeQuietly(output, mapperContext);
}
}
去WrappedMapper.getMapContext看返回来的是什么?
返回一个Contex对象,这个对象中有三个方法getCurrentKey getCurrentValue nextKeyValue
2.3 org.apache.hadoop.mapreduce.lib.map.WrappedMapper
/**
* Get a wrapped {@link Mapper.Context} for custom implementations.
* @param mapContext <code>MapContext</code> to be wrapped
* @return a wrapped <code>Mapper.Context</code> for custom implementations
*/
public Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context
getMapContext(MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> mapContext) {
//返回一个Context对象,这个Context中应该主有我们Mapper中的三个方法nextKeyValue getcurrentkey getcurrentvalue
return new Context(mapContext);
}
//下边儿Context中确实有getCurrentKey getCurrentValue nextKeyValue
//这三个方法返回值来自this.mapContext=mapContext
//而mapContex参数我们再往回找传入的地方在mapTask中创建了一个MapContextImpl对象,去看这个对象中继续看
@InterfaceStability.Evolving
public class Context
extends Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Context {
protected MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> mapContext;
// 构造方法
public Context(MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> mapContext) {
this.mapContext = mapContext;
}
/**
* Get the input split for this map.
*/
public InputSplit getInputSplit() {
return mapContext.getInputSplit();
}
@Override
public KEYIN getCurrentKey() throws IOException, InterruptedException {
return mapContext.getCurrentKey();
}
@Override
public VALUEIN getCurrentValue() throws IOException, InterruptedException {
return mapContext.getCurrentValue();
}
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
return mapContext.nextKeyValue();
}
@Override
public Counter getCounter(Enum<?> counterName) {
return mapContext.getCounter(counterName);
}
@Override
public Counter getCounter(String groupName, String counterName) {
return mapContext.getCounter(groupName, counterName);
}
@Override
public OutputCommitter getOutputCommitter() {
return mapContext.getOutputCommitter();
}
@Override
public void write(KEYOUT key, VALUEOUT value) throws IOException,
InterruptedException {
mapContext.write(key, value);
}
@Override
public String getStatus() {
return mapContext.getStatus();
}
@Override
public TaskAttemptID getTaskAttemptID() {
return mapContext.getTaskAttemptID();
}
@Override
public void setStatus(String msg) {
mapContext.setStatus(msg);
}
@Override
public Path[] getArchiveClassPaths() {
return mapContext.getArchiveClassPaths();
}
@Override
public String[] getArchiveTimestamps() {
return mapContext.getArchiveTimestamps();
}
@Override
public URI[] getCacheArchives() throws IOException {
return mapContext.getCacheArchives();
}
@Override
public URI[] getCacheFiles() throws IOException {
return mapContext.getCacheFiles();
}
@Override
public Class<? extends Reducer<?, ?, ?, ?>> getCombinerClass()
throws ClassNotFoundException {
return mapContext.getCombinerClass();
}
@Override
public Configuration getConfiguration() {
return mapContext.getConfiguration();
}
@Override
public Path[] getFileClassPaths() {
return mapContext.getFileClassPaths();
}
@Override
public String[] getFileTimestamps() {
return mapContext.getFileTimestamps();
}
@Override
public RawComparator<?> getCombinerKeyGroupingComparator() {
return mapContext.getCombinerKeyGroupingComparator();
}
@Override
public RawComparator<?> getGroupingComparator() {
return mapContext.getGroupingComparator();
}
@Override
public Class<? extends InputFormat<?, ?>> getInputFormatClass()
throws ClassNotFoundException {
return mapContext.getInputFormatClass();
}
@Override
public String getJar() {
return mapContext.getJar();
}
@Override
public JobID getJobID() {
return mapContext.getJobID();
}
@Override
public String getJobName() {
return mapContext.getJobName();
}
@Override
public boolean getJobSetupCleanupNeeded() {
return mapContext.getJobSetupCleanupNeeded();
}
@Override
public boolean getTaskCleanupNeeded() {
return mapContext.getTaskCleanupNeeded();
}
@Override
public Path[] getLocalCacheArchives() throws IOException {
return mapContext.getLocalCacheArchives();
}
@Override
public Path[] getLocalCacheFiles() throws IOException {
return mapContext.getLocalCacheFiles();
}
@Override
public Class<?> getMapOutputKeyClass() {
return mapContext.getMapOutputKeyClass();
}
@Override
public Class<?> getMapOutputValueClass() {
return mapContext.getMapOutputValueClass();
}
@Override
public Class<? extends Mapper<?, ?, ?, ?>> getMapperClass()
throws ClassNotFoundException {
return mapContext.getMapperClass();
}
@Override
public int getMaxMapAttempts() {
return mapContext.getMaxMapAttempts();
}
@Override
public int getMaxReduceAttempts() {
return mapContext.getMaxReduceAttempts();
}
@Override
public int getNumReduceTasks() {
return mapContext.getNumReduceTasks();
}
@Override
public Class<? extends OutputFormat<?, ?>> getOutputFormatClass()
throws ClassNotFoundException {
return mapContext.getOutputFormatClass();
}
@Override
public Class<?> getOutputKeyClass() {
return mapContext.getOutputKeyClass();
}
@Override
public Class<?> getOutputValueClass() {
return mapContext.getOutputValueClass();
}
@Override
public Class<? extends Partitioner<?, ?>> getPartitionerClass()
throws ClassNotFoundException {
return mapContext.getPartitionerClass();
}
@Override
public Class<? extends Reducer<?, ?, ?, ?>> getReducerClass()
throws ClassNotFoundException {
return mapContext.getReducerClass();
}
@Override
public RawComparator<?> getSortComparator() {
return mapContext.getSortComparator();
}
@Override
public boolean getSymlink() {
return mapContext.getSymlink();
}
@Override
public Path getWorkingDirectory() throws IOException {
return mapContext.getWorkingDirectory();
}
@Override
public void progress() {
mapContext.progress();
}
@Override
public boolean getProfileEnabled() {
return mapContext.getProfileEnabled();
}
@Override
public String getProfileParams() {
return mapContext.getProfileParams();
}
@Override
public IntegerRanges getProfileTaskRange(boolean isMap) {
return mapContext.getProfileTaskRange(isMap);
}
@Override
public String getUser() {
return mapContext.getUser();
}
@Override
public Credentials getCredentials() {
return mapContext.getCredentials();
}
@Override
public float getProgress() {
return mapContext.getProgress();
}
}
2.4 org.apache.hadoop.mapreduce.task.MapContextImpl
从Maptask中mapContext等于new MapContextImpl
这个MapContextImpl中也有getCurrentKey() getCurrentValue() nextKeyValue() 而这几个方法中返回值是通过reader赋值的,
而reader的值是从 MapContextImpl构造方法的第三个参数 RecordReader<KEYIN,VALUEIN> reader,
所以再回到maptask看第三个参数到底是怎么来的
public class MapContextImpl<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
extends TaskInputOutputContextImpl<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
implements MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
private RecordReader<KEYIN,VALUEIN> reader;
private InputSplit split;
public MapContextImpl(Configuration conf, TaskAttemptID taskid,
RecordReader<KEYIN,VALUEIN> reader,
RecordWriter<KEYOUT,VALUEOUT> writer,
OutputCommitter committer,
StatusReporter reporter,
InputSplit split) {
super(conf, taskid, writer, committer, reporter);
this.reader = reader;
this.split = split;
}
/**
* Get the input split for this map.
*/
public InputSplit getInputSplit() {
return split;
}
@Override
public KEYIN getCurrentKey() throws IOException, InterruptedException {
return reader.getCurrentKey();
}
@Override
public VALUEIN getCurrentValue() throws IOException, InterruptedException {
return reader.getCurrentValue();
}
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
return reader.nextKeyValue();
}
}
2.5 org.apache.hadoop.mapred.MapTask.NewTrackingRecordReader
maptask的input =
new NewTrackingRecordReader<INKEY,INVALUE>
(split, inputFormat, reporter, taskContext);
所以我们到NewTrackingRecordReader对象中看他的三个方法getCurrentKey() getCurrentValue() nextKeyValue()
这三个方法的返回值来自org.apache.hadoop.mapreduce.RecordReader<K,V> real,
this.real = inputFormat.createRecordReader(split, taskContext);
而inputFormat来自NewTrackingRecordReader构造参数的第二个参数
所以需要再次回到maptask找inputformat
static class NewTrackingRecordReader<K,V>
extends org.apache.hadoop.mapreduce.RecordReader<K,V> {
private final org.apache.hadoop.mapreduce.RecordReader<K,V> real;
private final org.apache.hadoop.mapreduce.Counter inputRecordCounter;
private final org.apache.hadoop.mapreduce.Counter fileInputByteCounter;
private final TaskReporter reporter;
private final List<Statistics> fsStats;
NewTrackingRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
org.apache.hadoop.mapreduce.InputFormat<K, V> inputFormat,
TaskReporter reporter,
org.apache.hadoop.mapreduce.TaskAttemptContext taskContext)
throws InterruptedException, IOException {
this.reporter = reporter;
this.inputRecordCounter = reporter
.getCounter(TaskCounter.MAP_INPUT_RECORDS);
this.fileInputByteCounter = reporter
.getCounter(FileInputFormatCounter.BYTES_READ);
List <Statistics> matchedStats = null;
if (split instanceof org.apache.hadoop.mapreduce.lib.input.FileSplit) {
matchedStats = getFsStatistics(((org.apache.hadoop.mapreduce.lib.input.FileSplit) split)
.getPath(), taskContext.getConfiguration());
}
fsStats = matchedStats;
long bytesInPrev = getInputBytes(fsStats);
this.real = inputFormat.createRecordReader(split, taskContext);
long bytesInCurr = getInputBytes(fsStats);
fileInputByteCounter.increment(bytesInCurr - bytesInPrev);
}
@Override
public void close() throws IOException {
long bytesInPrev = getInputBytes(fsStats);
real.close();
long bytesInCurr = getInputBytes(fsStats);
fileInputByteCounter.increment(bytesInCurr - bytesInPrev);
}
@Override
public K getCurrentKey() throws IOException, InterruptedException {
return real.getCurrentKey();
}
@Override
public V getCurrentValue() throws IOException, InterruptedException {
return real.getCurrentValue();
}
@Override
public float getProgress() throws IOException, InterruptedException {
return real.getProgress();
}
@Override
public void initialize(org.apache.hadoop.mapreduce.InputSplit split,
org.apache.hadoop.mapreduce.TaskAttemptContext context
) throws IOException, InterruptedException {
long bytesInPrev = getInputBytes(fsStats);
real.initialize(split, context);
long bytesInCurr = getInputBytes(fsStats);
fileInputByteCounter.increment(bytesInCurr - bytesInPrev);
}
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
long bytesInPrev = getInputBytes(fsStats);
boolean result = real.nextKeyValue();
long bytesInCurr = getInputBytes(fsStats);
if (result) {
inputRecordCounter.increment(1);
}
fileInputByteCounter.increment(bytesInCurr - bytesInPrev);
reporter.setProgress(getProgress());
return result;
}
private long getInputBytes(List<Statistics> stats) {
if (stats == null) return 0;
long bytesRead = 0;
for (Statistics stat: stats) {
bytesRead = bytesRead + stat.getBytesRead();
}
return bytesRead;
}
}
2.6 org.apache.hadoop.mapreduce.JobContext#getInputFormatClass
从maptask中
inputFormat =
(org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE>)
ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job);
发现这是一个抽象类,而且返回类型是InputFormat的一个子类
我们继结找实现方法getInputFormatClass的返回类型为InputFormat的子类的实现类
/**
* Get the {@link InputFormat} class for the job.
*
* @return the {@link InputFormat} class for the job.
*/
public Class<? extends InputFormat<?,?>> getInputFormatClass()
throws ClassNotFoundException;
2.7 org.apache.hadoop.mapreduce.task.JobContextImpl#getInputFormatClass
/**
* Get the {@link InputFormat} class for the job.
*
* @return the {@link InputFormat} class for the job.
*/
@SuppressWarnings("unchecked")
public Class<? extends InputFormat<?,?>> getInputFormatClass()
throws ClassNotFoundException {
return (Class<? extends InputFormat<?,?>>)
//conf —————— job.xml
//public static final String INPUT_FORMAT_CLASS_ATTR = "mapreduce.job.inputformat.class";
//去mapper-default.xml中查找发现没有mapreduce.job.inputformat.class,所以getClass返回第二个参数TextInputFormat.class
//去TextInputFormat中找createRecordReader返回了什么
conf.getClass(INPUT_FORMAT_CLASS_ATTR, TextInputFormat.class);
}
2.8 org.apache.hadoop.mapreduce.lib.input.TextInputFormat
可以看到createRecordReader返回new LineRecordReader(recordDelimiterBytes),
所以去LineRecordReader中应该就能找到nextKeyValue() getCurrentKey() getCurrentValue() 真正实现
public class TextInputFormat extends FileInputFormat<LongWritable, Text> {
@Override
public RecordReader<LongWritable, Text>
createRecordReader(InputSplit split,
TaskAttemptContext context) {
String delimiter = context.getConfiguration().get(
"textinputformat.record.delimiter");
byte[] recordDelimiterBytes = null;
if (null != delimiter)
recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
return new LineRecordReader(recordDelimiterBytes);
}
@Override
protected boolean isSplitable(JobContext context, Path file) {
final CompressionCodec codec =
new CompressionCodecFactory(context.getConfiguration()).getCodec(file);
if (null == codec) {
return true;
}
return codec instanceof SplittableCompressionCodec;
}
}
2.9 org.apache.hadoop.mapreduce.lib.input.LineRecordReader
这个里边儿就是真正实现怎么一行一行读取了
public class LineRecordReader extends RecordReader<LongWritable, Text> {
private static final Log LOG = LogFactory.getLog(LineRecordReader.class);
public static final String MAX_LINE_LENGTH =
"mapreduce.input.linerecordreader.line.maxlength";
private long start;
//位置 当前的行首的偏移量
private long pos;
private long end;
private SplitLineReader in;
private FSDataInputStream fileIn;
private Seekable filePosition;
private int maxLineLength;
//每一行的起始偏移量
private LongWritable key;
//每一行的内容
private Text value;
private boolean isCompressedInput;
private Decompressor decompressor;
private byte[] recordDelimiterBytes;
public LineRecordReader() {
}
public LineRecordReader(byte[] recordDelimiter) {
this.recordDelimiterBytes = recordDelimiter;
}
public void initialize(InputSplit genericSplit,
TaskAttemptContext context) throws IOException {
FileSplit split = (FileSplit) genericSplit;
Configuration job = context.getConfiguration();
this.maxLineLength = job.getInt(MAX_LINE_LENGTH, Integer.MAX_VALUE);
start = split.getStart();
end = start + split.getLength();
final Path file = split.getPath();
// open the file and seek to the start of the split
final FileSystem fs = file.getFileSystem(job);
fileIn = fs.open(file);
CompressionCodec codec = new CompressionCodecFactory(job).getCodec(file);
if (null!=codec) {
isCompressedInput = true;
decompressor = CodecPool.getDecompressor(codec);
if (codec instanceof SplittableCompressionCodec) {
final SplitCompressionInputStream cIn =
((SplittableCompressionCodec)codec).createInputStream(
fileIn, decompressor, start, end,
SplittableCompressionCodec.READ_MODE.BYBLOCK);
in = new CompressedSplitLineReader(cIn, job,
this.recordDelimiterBytes);
start = cIn.getAdjustedStart();
end = cIn.getAdjustedEnd();
filePosition = cIn;
} else {
in = new SplitLineReader(codec.createInputStream(fileIn,
decompressor), job, this.recordDelimiterBytes);
filePosition = fileIn;
}
} else {
fileIn.seek(start);
in = new UncompressedSplitLineReader(
fileIn, job, this.recordDelimiterBytes, split.getLength());
filePosition = fileIn;
}
// If this is not the first split, we always throw away first record
// because we always (except the last split) read one extra line in
// next() method.
if (start != 0) {
start += in.readLine(new Text(), 0, maxBytesToConsume(start));
}
this.pos = start;
}
private int maxBytesToConsume(long pos) {
return isCompressedInput
? Integer.MAX_VALUE
: (int) Math.max(Math.min(Integer.MAX_VALUE, end - pos), maxLineLength);
}
private long getFilePosition() throws IOException {
long retVal;
if (isCompressedInput && null != filePosition) {
retVal = filePosition.getPos();
} else {
retVal = pos;
}
return retVal;
}
private int skipUtfByteOrderMark() throws IOException {
// Strip BOM(Byte Order Mark)
// Text only support UTF-8, we only need to check UTF-8 BOM
// (0xEF,0xBB,0xBF) at the start of the text stream.
int newMaxLineLength = (int) Math.min(3L + (long) maxLineLength,
Integer.MAX_VALUE);
int newSize = in.readLine(value, newMaxLineLength, maxBytesToConsume(pos));
// Even we read 3 extra bytes for the first line,
// we won't alter existing behavior (no backwards incompat issue).
// Because the newSize is less than maxLineLength and
// the number of bytes copied to Text is always no more than newSize.
// If the return size from readLine is not less than maxLineLength,
// we will discard the current line and read the next line.
pos += newSize;
int textLength = value.getLength();
byte[] textBytes = value.getBytes();
if ((textLength >= 3) && (textBytes[0] == (byte)0xEF) &&
(textBytes[1] == (byte)0xBB) && (textBytes[2] == (byte)0xBF)) {
// find UTF-8 BOM, strip it.
LOG.info("Found UTF-8 BOM and skipped it");
textLength -= 3;
newSize -= 3;
if (textLength > 0) {
// It may work to use the same buffer and not do the copyBytes
textBytes = value.copyBytes();
value.set(textBytes, 3, textLength);
} else {
value.clear();
}
}
return newSize;
}
public boolean nextKeyValue() throws IOException {
//判断key==null 进行key的初始化
if (key == null) {
key = new LongWritable();
}
//将偏移量赋值给key pos初始0
key.set(pos);
//pos+=bytes 5 hello
//对value初始化
if (value == null) {
value = new Text();
}
//统计当前行的字节
int newSize = 0;
// We always read one extra line, which lies outside the upper
// split limit i.e. (end - 1)
while (getFilePosition() <= end || in.needAdditionalRecordAfterSplit()) {
//给newSize 赋值
if (pos == 0) {
newSize = skipUtfByteOrderMark();
} else {
newSize = in.readLine(value, maxLineLength, maxBytesToConsume(pos));
//给pos重新赋值
pos += newSize;
}
if ((newSize == 0) || (newSize < maxLineLength)) {
break;
}
// line too long. try again
LOG.info("Skipped line of size " + newSize + " at pos " +
(pos - newSize));
}
//切片读取完成
if (newSize == 0) {
//提醒垃圾回收
key = null;
value = null;
return false;
} else {
return true;
}
}
//返回的就是属性
//返回的就是当前行的起始偏移量
@Override
public LongWritable getCurrentKey() {
return key;
}
//代表的是一行的内容
@Override
public Text getCurrentValue() {
return value;
}
/**
* Get the progress within the split
*/
public float getProgress() throws IOException {
if (start == end) {
return 0.0f;
} else {
return Math.min(1.0f, (getFilePosition() - start) / (float)(end - start));
}
}
public synchronized void close() throws IOException {
try {
if (in != null) {
in.close();
}
} finally {
if (decompressor != null) {
CodecPool.returnDecompressor(decompressor);
decompressor = null;
}
}
}
}
2.10 源码总结
整个流程走下来,我们首先要抓住线索,就是不断的去找到底哪一步真正实现了getCurrentKey(),getCurrentValue(),nextKeyValue(),
最终找到TextInputFormat.createRecordReader()返回一个LineRecordReader中实现的。
我们自定义输入也按着这个套路写就对了。
public void run(Context context) throws IOException, InterruptedException {
setup(context);
try {
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
} finally {
cleanup(context);
}
}
默认的输入:
- FileInputFormat
- TextInputFormat
- RecordReader
- LineRecordReader
- RecordReader
- TextInputFormat
3. 自定义输入
现在分析大量小文件合并一个大文件
自定义输入:
- 创建一个类继承FileInputFormat
重写createRecordReader() - 创建一个文件真正的读取器,继承RecordReader
重写getcurrentkey() getcurrentvalue() nextkeyvalue() - job中指定自定义的输入类
job.setInputFormatClass(MyFileInputFormat.class);
要求:
三个小文件1.txt 2.txt 3.txt
合并成一个大文件
MyFileInputFormat.java
/**
* 泛型指的是 读取的key value的类型
* 读取之后 mapper的输入
*
* 每次读取一个文件
* 文件内容 Text
* 这里key可以为Null
*/
public class MyFileInputFormat extends FileInputFormat<NullWritable, Text> {
/**
* 获取文件读取器
*
* FileInputFormat.addInputPath(job,"") job.xml
*
* InputSplit split, TaskAttemptContext context
*/
public RecordReader<NullWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
MyRecordReader mr = new MyRecordReader();
//传参
mr.initialize(split,context);
return mr;
}
}
MyRecordReader.java
/**
* 文件读取器 核心的进行文件读取
* 创建一个流
* hdfs的流
* FileSystem fs
* fs.open(path)
* 注意
* 进行文件读取的时候,首先就是进入nextKeyValue,判断有没有内容要读取,然后才会getCurrentKey() getCurrentValue()
*/
public class MyRecordReader extends RecordReader<NullWritable, Text> {
FileSystem fs;
int lenth;
FSDataInputStream fsDataInputStream;
Text value=new Text();
//属性 判断是否读取完成 默认false true--读取完成 false--没有读完
boolean isReader;
/**
* 初始化 创建hdfs的输入流
* @param split 输入的切片
* @param context task的上下文对象
* @throws IOException
* @throws InterruptedException
*/
public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
//初始化fs对象 context.getConfiguration() 获取配置文件
fs = FileSystem.get(context.getConfiguration());
//获取文件路径
FileSplit fsplit = (FileSplit) split;
Path path = fsplit.getPath();
//获取文件的实际长度
lenth = (int) fsplit.getLength();
//创建流
fsDataInputStream = fs.open(path);
}
/**
* 判断当前文件切片是否还有要读取的内容
* @return 代表是否继续读取,false表示没有读完,true表示读取完成
* @throws IOException
* @throws InterruptedException
*/
public boolean nextKeyValue() throws IOException, InterruptedException {
if(!isReader){//如果还有内容要读
//读取的内容
byte[] buf = new byte[lenth];
fsDataInputStream.readFully(buf,0,lenth);
//将读取的内容放到value中
value.set(buf);
//标记是否读取完成,我们一次读取完了一个文件,所以是
isReader = true;
return true;
}else {
return false;
}
}
public NullWritable getCurrentKey() throws IOException, InterruptedException {
return NullWritable.get();
}
public Text getCurrentValue() throws IOException, InterruptedException {
return this.value;
}
public float getProgress() throws IOException, InterruptedException {
//文件要么读取完成,要么没读
return isReader?1.0f:0.0f;
}
/**
* 关闭流
* @throws IOException
*/
public void close() throws IOException {
if(fsDataInputStream!=null){
fsDataInputStream.close();
}
if(fs!=null){
fs.close();
}
}
}
MergeFiles.java
/**
* 默认的map() 方法是一行调用一次
* 我们要自定义输入一次读到一个文件 ,一个文件调用一次map,然后直接发送给reduce端
* reudce端将所有文件进行合并
*/
public class MergeFiles {
static class MergeFilesMapper extends Mapper<NullWritable, Text, Text, NullWritable> {
//读一个文件调用一次
@Override
protected void map(NullWritable key, Text value, Context context) throws IOException, InterruptedException {
//将读取的文件内容,直接发送给reduce端
context.write(value, NullWritable.get());
}
}
static class MergeFilesReducer extends Reducer<Text, NullWritable, Text, NullWritable> {
@Override
protected void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
for (NullWritable v : values) {
context.write(key,NullWritable.get());
}
}
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
System.setProperty("HADOOP_USER_NAME","hdp01");
Configuration conf = new Configuration();
conf.set("mapperduce.framework.name","local");
conf.set("fs.defaultFS","hdfs://10.211.55.20:9000");
Job job = Job.getInstance(conf);
job.setJarByClass(FlowSort2.class);
job.setMapperClass(MergeFilesMapper.class);
job.setReducerClass(MergeFilesReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(NullWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
//指定自定义输入类
job.setInputFormatClass(MyFileInputFormat.class);
FileInputFormat.addInputPath(job,new Path("/tmpin/invetedIndex"));
FileSystem fs= FileSystem.get(conf);
Path outPath = new Path("/tmpout/mergeFiles");
if(fs.exists(outPath)){//存在 删除
fs.delete(outPath,true);
}
FileOutputFormat.setOutputPath(job,outPath);
job.waitForCompletion(true);
}
}
输入文件:
[hdp01@hdp01 tmpfiles]$ cat 1.txt
A friend in need is a friend indeed
Good is good but better carries it
[hdp01@hdp01 tmpfiles]$ cat 2.txt
A good name is better than riches
Time is a bird for ever on the wing
Adversity is a good disciple
[hdp01@hdp01 tmpfiles]$ cat 3.txt
Doubt is the key to knowledge
输出文件:
[hdp01@hdp01 tmpfiles]$ hdfs dfs -cat /tmpout/mergeFiles/part-r-00000
A friend in need is a friend indeed
Good is good but better carries it
A good name is better than riches
Time is a bird for ever on the wing
Adversity is a good disciple
Doubt is the key to knowledge
来源:CSDN
作者:霁泽Coding
链接:https://blog.csdn.net/jiajane/article/details/103536568