How to parse CustomWritable from text in Hadoop

限于喜欢 提交于 2019-12-23 11:56:48

问题


Say I have timestamped values for specific users in text files, like

#userid; unix-timestamp; value
1; 2010-01-01 00:00:00; 10
2; 2010-01-01 00:00:00; 20
1; 2010-01-01 01:00:00; 11
2; 2010-01-01 01:00:00, 21
1; 2010-01-02 00:00:00; 12
2; 2010-01-02 00:00:00; 22

I have a custom class "SessionSummary" implementing readFields and write of WritableComparable. It's purpose is to sum up all values per user for each calendar day.

So the mapper maps the lines to each user, the reducer summarizes all values per day per user and outputs a SessionSummary as TextOutputFormat (using toString of SessionSummary, as tab-separated UTF-8 strings):

1; 2010-01-01; 21
2; 2010-01-01; 41
1; 2010-01-02; 12
2; 2010-01-02; 22

If I need to use these summary-entries for a second Map/Reduce stage, how should I parse this summary data to populate the members? Can I reuse the existing readFields and write-methods (of the WritableComparable interface implementation) by using the text String as DataInput somehow? This (obviously) did not work:

public void map(...) {
    SessionSummary ssw = new SessionSummary();
    ssw.readFields(new DataInputStream(new ByteArrayInputStream(value.getBytes("UTF-8"))));
}

In general: Is there a best practice to implement custom keys and values in Hadoop and make them easily reusable across several M/R stages, while keeping human-readable text output at every stage?

(Hadoop version is 0.20.2 / CDH3u3)


回答1:


The output format for your first MR job should be SequenceFileOutputFormat - this will store the Key/Values output from the reducer in a binary format, that can then be read back in, in your second MR job using SequenceFileInputFormat. Also make sure you set the outputKeyClass and outputValueClass on the Job accordingly.

The mapper in the second job then has SessionSummary (and whatever the value type is)

If you need to see the textual output from the first MR job, you can run the following on the output files in HDFS:

hadoop fs -libjars my-lib.jar -text output-dir/part-r-*

This will read in the sequence file Key/Value pairs and call toString() on both objects, tab separating them when outputting to stdout. The -libjars specifies where hadoop can find your custom Key / Value classes



来源:https://stackoverflow.com/questions/9721754/how-to-parse-customwritable-from-text-in-hadoop

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!