Java MapReduce counting by date

后端 未结 2 880
萌比男神i
萌比男神i 2021-01-28 09:16

I\'m new to Hadoop, and i\'m trying to do a MapReduce program, to count the max first two occurrencise of lecters by date (grouped by month). So my input is of this kind :

2条回答
  •  萌比男神i
    2021-01-28 10:15

    I think you're trying to do too much work in the Mapper. You only need to group the dates (which it seems you aren't formatting them correctly anyway based on your expected output).

    The following approach is going to turn these lines, for example

    2017-07-01 , A, B, A, C, B, E, F
    2017-07-05 , A, B, A, G, B, G, G
    

    Into this pair for the reducer

    2017-07 , ("A,B,A,C,B,E,F", "A,B,A,G,B,G,G")
    

    In other words, you gain no real benefit by using an ArrayWritable, just keep it as text.


    So, the Mapper would look like this

    class CustomMap extends Mapper {
    
        private final Text key = new Text();
        private final Text output = new Text();
    
        @Override
        protected void map(LongWritable offset, Text value, Context context) throws IOException, InterruptedException {
    
            int separatorIndex = value.find(",");
    
            final String valueStr = value.toString();
            if (separatorIndex < 0) {
                System.err.printf("mapper: not enough records for %s", valueStr);
                return;
            }
            String dateKey = valueStr.substring(0, separatorIndex).trim();
            String tokens = valueStr.substring(1 + separatorIndex).trim().replaceAll("\\p{Space}", "");
    
            SimpleDateFormat fmtFrom = new SimpleDateFormat("yyyy-MM-dd");
            SimpleDateFormat fmtTo = new SimpleDateFormat("yyyy-MM");
    
            try {
                dateKey = fmtTo.format(fmtFrom.parse(dateKey));
                key.set(dateKey);
            } catch (ParseException ex) {
                System.err.printf("mapper: invalid key format %s", dateKey);
                return;
            }
    
            output.set(tokens);
            context.write(key, output);
        }
    }
    

    And then the reducer can build a Map that collects and counts the values from the value strings. Again, writing out only Text.

    class CustomReduce extends Reducer {
    
        private final Text output = new Text();
    
        @Override
        protected void reduce(Text date, Iterable values, Context context) throws IOException, InterruptedException {
    
            Map keyMap = new TreeMap<>();
            for (Text v : values) {
                String[] keys = v.toString().trim().split(",");
    
                for (String key : keys) {
                    if (!keyMap.containsKey(key)) {
                        keyMap.put(key, 0);
                    }
                    keyMap.put(key, 1 + keyMap.get(key));
                }
            }
    
            output.set(mapToString(keyMap));
            context.write(date, output);
        }
    
        private String mapToString(Map map) {
            StringBuilder sb = new StringBuilder();
            String delimiter = ", ";
            for (Map.Entry entry : map.entrySet()) {
                sb.append(
                        String.format("%s:%d", entry.getKey(), entry.getValue())
                ).append(delimiter);
            }
            sb.setLength(sb.length()-delimiter.length());
            return sb.toString();
        }
    }
    

    Given your input, I got this

    2017-06 A:4, B:4, C:1, E:4, F:3, K:1, Q:2, R:1, T:1
    2017-07 A:4, B:4, C:1, E:1, F:1, G:3
    

提交回复
热议问题