How to specify tab as a record separator for hadoop input text file?

守給你的承諾、 提交于 2019-12-25 05:06:10

问题


The input file to my hadoop M/R job is a text file in which the records are separated by tab character '\t' instead of newline '\n'. How can I instruct hadoop to split using the tab character as by default it splits around newlines and each line in the text file is taken as a record.

One way to do it is to use a custom input format class that uses a filter stream to convert all tabs in the original stream to newlines. But this does not look elegant.

Another way would be to use java.util.Scanner with tab as the separator. But I cannot figure out how to use the java.util.Scanner class in the input format classes.

What is the best approach and alternatives?


回答1:


Values '\r' and '\n' hard-coded in org.apache.hadoop.util.LineReader class, so you can't use TextInputFormat with tab-separated records. But it is not difficult to implement own InputFormat with special LineReader class. The simplest solution is to copy-paste TextInputFormat, LineRecordReader and LineReader classes, move them to your package and change LineReader implementation.



来源:https://stackoverflow.com/questions/7271641/how-to-specify-tab-as-a-record-separator-for-hadoop-input-text-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!