Mapreduce XML input format - to build custom format

佐手、 提交于 2019-12-11 06:08:49

问题


If the input files in XML format, I shouldn't be using TextInputFormat because TextInputFormat assumes each record is in each line of the input file and the Mapper class is called for each line to get a Key Value pair for that record/line.

So I think we need a custom input format to scan the XML datasets.

Being new to Hadoop mapreduce, is there any article/link/video that shows the steps to build a custom input format?

thanks nath


回答1:


Problem Working on a single XML file in parallel in MapReduce is tricky because XML does not contain a synchronization marker in its data format. Therefore, how do we work with a file format that’s not inherently splittable like XML?

Solution MapReduce doesn’t contain built-in support for XML, so we have to turn to another Apache project, Mahout, a machine learning system, which provides an XML InputFormat.

So I mean no need to have custom input format since Mahout library present. I am not sure, whether you are going to read or write but both were described in above link.

Pls have a look at XmlInputFormat implementation details here.

Furthermore, XmlInputFormat extends TextInputFormat



来源:https://stackoverflow.com/questions/37848347/mapreduce-xml-input-format-to-build-custom-format

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!