How to get the input file name in the mapper in a Hadoop program?

后端 未结 10 2030
粉色の甜心
粉色の甜心 2020-11-29 18:48

How I can get the name of the input file within a mapper? I have multiple input files stored in the input directory, each mapper may read a different file, and I need to kno

10条回答
  •  温柔的废话
    2020-11-29 19:09

    The answers which advocate casting to FileSplit will no longer work, as FileSplit instances are no longer returned for multiple inputs (so you will get a ClassCastException). Instead, org.apache.hadoop.mapreduce.lib.input.TaggedInputSplit instances are returned. Unfortunately, the TaggedInputSplit class is not accessible without using reflection. So here's a utility class I wrote for this. Just do:

    Path path = MapperUtils.getPath(context.getInputSplit());
    

    in your Mapper.setup(Context context) method.

    Here is the source code for my MapperUtils class:

    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.mapreduce.InputSplit;
    import org.apache.hadoop.mapreduce.lib.input.FileSplit;
    
    import java.lang.invoke.MethodHandle;
    import java.lang.invoke.MethodHandles;
    import java.lang.invoke.MethodType;
    import java.lang.reflect.Method;
    import java.util.Optional;
    
    public class MapperUtils {
    
        public static Path getPath(InputSplit split) {
            return getFileSplit(split).map(FileSplit::getPath).orElseThrow(() -> 
                new AssertionError("cannot find path from split " + split.getClass()));
        }
    
        public static Optional getFileSplit(InputSplit split) {
            if (split instanceof FileSplit) {
                return Optional.of((FileSplit)split);
            } else if (TaggedInputSplit.clazz.isInstance(split)) {
                return getFileSplit(TaggedInputSplit.getInputSplit(split));
            } else {
                return Optional.empty();
            }
        }
    
        private static final class TaggedInputSplit {
            private static final Class clazz;
            private static final MethodHandle method;
    
            static {
                try {
                    clazz = Class.forName("org.apache.hadoop.mapreduce.lib.input.TaggedInputSplit");
                    Method m = clazz.getDeclaredMethod("getInputSplit");
                    m.setAccessible(true);
                    method = MethodHandles.lookup().unreflect(m).asType(
                        MethodType.methodType(InputSplit.class, InputSplit.class));
                } catch (ReflectiveOperationException e) {
                    throw new AssertionError(e);
                }
            }
    
            static InputSplit getInputSplit(InputSplit o) {
                try {
                    return (InputSplit) method.invokeExact(o);
                } catch (Throwable e) {
                    throw new AssertionError(e);
                }
            }
        }
    
        private MapperUtils() { }
    
    }
    

提交回复
热议问题