manipulating iterator in mapreduce

前端 未结 4 1490
死守一世寂寞
死守一世寂寞 2020-12-15 01:43

I am trying to find the sum of any given points using hadoop, The issue I am having is on getting all values from a given key in a single reducer. It looks like this.

<
4条回答
  •  不知归路
    2020-12-15 02:31

    Going by your previous question, you appear to be stuck on the iterator problem piccolbo described. The formulation of your reducer also indicates you've forgone his proposed algorithms for the naive approach... which will work, albeit suboptimally.

    Allow me to clean up your code a bit with my answer:

    // Making use of Hadoop's Iterable reduce, assuming it's available to you
    //
    //  The method signature is:
    //
    //  protected void reduce(KEYIN key, java.lang.Iterable values, 
    //   org.apache.hadoop.mapreduce.Reducer.Context 
    //   context) throws java.io.IOException, java.lang.InterruptedException
    //
    public void reduce(Text key, Iterable values, Context context)
            throws IOException, InterruptedException {
    
        // I assume you declare this here to save on GC
        Text outKey = new Text();
        IntWritable outVal = new IntWritable();
    
        // Since you've forgone piccolbo's approach, you'll need to maintain the
        // data structure yourself. Since we always walk the list forward and
        // wish to optimize the insertion speed, we use LinkedList. Calls to
        // IntWritable.get() will give us an int, which we then copy into our list.
        LinkedList valueList = new LinkedList();
    
        // Here's why we changed the method signature: use of Java's for-each
        for (IntWritable iw: values) {
            valueList.add(iw.get());
        }
    
        // And from here, we construct each value pair as an O(n^2) operation
        for (Integer i: valueList) {
            for (Integer j: valueList) {
                outKey.set(i + " + " + j);
                outVal.set(i + j);
                context.write(outKey, outVal);
            }
        }
    
        // Do note: I've also changed your return value from DoubleWritable to
        // IntWritable, since you should always be performing integer operations
        // as defined. If your points are Double, supply DoubleWritable instead.
    }
    

    This works, but it makes several assumptions that limit performance when constructing your distance matrix, including requiring the combination to be performed in a single reduce operation.

    Consider piccolbo's approach if you know the size and dimensionality of your input data set in advance. This should be available, in the worst case, by walking the lines of input in linear time.

    (See this thread for why we can't implement this as a forward iterator.)

提交回复
热议问题