custom inputformat for reading json in hadoop

旧巷老猫 提交于 2019-12-04 16:02:20

If your questions is in accordance with what Magham Ravi commented, the answer is fine.

But, if you have a single file with all JSON data as you have mentioned above, you might want to read the whole file and retrieve it as a String from the value part (BytesWritable value) in map function and feed it to your JSON parser available inside the same map() function.

Please have a look at WholeFileInputFormat

Furthermore, if you have say multiple JSON object data in a single file and what to get each JSON object data as values in the mapper, you can use something like the XMLInputFormat with start and end tags defined. In your case for JSON, you must have a unique start and end tags that exactly marks the start and end of a single JSON data object you want. Merely, using start-tag = "[{" and end-tag = "}]" might not help if you want the whole JSON object as above to be returned as a value, because you already have many of those nested that would confuse the InputFormat.

If you are not able to achieve the above in any case, try building your customTextInputFormat overriding LineReader defined in TextInputFormat.

In LineReader class, you'll fine these two set ( I may be a little outdated, please check if that's configurable now using a configuration property, I know that CDH has made it configurable, if not your need to override)

private static final byte CR = '\r';
private static final byte LF = '\n';

And you can let go CR and change LF to poing to "]\n[", since each of your independent JSON data would be in the form as shown or you'll know it better how?

[

...JSON 1

]

[

...JSON 2

]

[

...JSON N

]

(NOTE: There is a \n in between ] and [ that marks as a boundary between different JSON object's data.

Hope this makes sense.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!