Iterate twice on values (MapReduce)

前端 未结 11 1073
轮回少年
轮回少年 2020-11-29 07:22

I receive an iterator as argument and I would like to iterate on values twice.

public void reduce(Pair key, Iterator          


        
11条回答
  •  执笔经年
    2020-11-29 08:14

    Unfortunately this is not possible without caching the values as in Andreas_D's answer.

    Even using the new API, where the Reducer receives an Iterable rather than an Iterator, you cannot iterate twice. It's very tempting to try something like:

    for (IntWritable value : values) {
        // first loop
    }
    
    for (IntWritable value : values) {
        // second loop
    }
    

    But this won't actually work. The Iterator you receive from that Iterable's iterator() method is special. The values may not all be in memory; Hadoop may be streaming them from disk. They aren't really backed by a Collection, so it's nontrivial to allow multiple iterations.

    You can see this for yourself in the Reducer and ReduceContext code.

    Caching the values in a Collection of some sort may be the easiest answer, but you can easily blow the heap if you are operating on large datasets. If you can give us more specifics on your problem, we may be able to help you find a solution that doesn't involve multiple iterations.

提交回复
热议问题