I\'m new to Hadoop, and i\'m trying to do a MapReduce program, to count the max first two occurrencise of lecters by date (grouped by month). So my input is of this kind :
I think you're trying to do too much work in the Mapper. You only need to group the dates (which it seems you aren't formatting them correctly anyway based on your expected output).
The following approach is going to turn these lines, for example
2017-07-01 , A, B, A, C, B, E, F
2017-07-05 , A, B, A, G, B, G, G
Into this pair for the reducer
2017-07 , ("A,B,A,C,B,E,F", "A,B,A,G,B,G,G")
In other words, you gain no real benefit by using an ArrayWritable
, just keep it as text.
So, the Mapper would look like this
class CustomMap extends Mapper {
private final Text key = new Text();
private final Text output = new Text();
@Override
protected void map(LongWritable offset, Text value, Context context) throws IOException, InterruptedException {
int separatorIndex = value.find(",");
final String valueStr = value.toString();
if (separatorIndex < 0) {
System.err.printf("mapper: not enough records for %s", valueStr);
return;
}
String dateKey = valueStr.substring(0, separatorIndex).trim();
String tokens = valueStr.substring(1 + separatorIndex).trim().replaceAll("\\p{Space}", "");
SimpleDateFormat fmtFrom = new SimpleDateFormat("yyyy-MM-dd");
SimpleDateFormat fmtTo = new SimpleDateFormat("yyyy-MM");
try {
dateKey = fmtTo.format(fmtFrom.parse(dateKey));
key.set(dateKey);
} catch (ParseException ex) {
System.err.printf("mapper: invalid key format %s", dateKey);
return;
}
output.set(tokens);
context.write(key, output);
}
}
And then the reducer can build a Map that collects and counts the values from the value strings. Again, writing out only Text.
class CustomReduce extends Reducer {
private final Text output = new Text();
@Override
protected void reduce(Text date, Iterable values, Context context) throws IOException, InterruptedException {
Map keyMap = new TreeMap<>();
for (Text v : values) {
String[] keys = v.toString().trim().split(",");
for (String key : keys) {
if (!keyMap.containsKey(key)) {
keyMap.put(key, 0);
}
keyMap.put(key, 1 + keyMap.get(key));
}
}
output.set(mapToString(keyMap));
context.write(date, output);
}
private String mapToString(Map map) {
StringBuilder sb = new StringBuilder();
String delimiter = ", ";
for (Map.Entry entry : map.entrySet()) {
sb.append(
String.format("%s:%d", entry.getKey(), entry.getValue())
).append(delimiter);
}
sb.setLength(sb.length()-delimiter.length());
return sb.toString();
}
}
Given your input, I got this
2017-06 A:4, B:4, C:1, E:4, F:3, K:1, Q:2, R:1, T:1
2017-07 A:4, B:4, C:1, E:1, F:1, G:3