问题
Is there a way to have a whole file sent to a mapper without being split?
I have read this but I am wondering if there is another way of doing the same thing without having to generate an intermediate file. Ideally, I would like an existing option on the command line to Hadoop.
I am using the streaming
facility with Python scripts on Amazon EMR.
回答1:
Just set the configuration property mapred.min.split.size
to something huge (10G):
-D mapred.min.split.size=10737418240
Or compress the input file using a codec that isn't splittable (Gzip). With the .gz extension, TextInputFormat will return false to the isSplittable(FileSystem, Path)
method
来源:https://stackoverflow.com/questions/10969517/hadoop-non-splittable-textinputformat