Java 8: Extracting a pair of arrays out of a Stream

问题

So I have some code using Java 8 streams, and it works. It does exactly what I need it to do, and it's legible (a rarity for functional programming). Towards the end of a subroutine, the code runs over a List of a custom pair type:

// All names Hungarian-Notation-ized for SO reading
class AFooAndABarWalkIntoABar
{
    public int      foo_int;
    public BarClass bar_object;
    ....
}

List<AFooAndABarWalkIntoABar> results = ....;

The data here must be passed into other parts of the program as arrays, so they get copied out:

// extract either a foo or a bar from each "foo-and-bar" (fab)
int[] foo_array = results.stream()
    .mapToInt (fab -> fab.foo_int)
    .toArray();

BarClass[] bar_array = results.stream()
    .map (fab -> fab.bar_object)
    .toArray(BarClass[]::new);

And done. Now each array can go do its thing.

Except... that loop over the List twice bothers me in my soul. And if we ever need to track more information, they're likely going to add a third field, and then have to make a third pass to turn the 3-tuple into three arrays, etc. So I'm fooling around with trying to do it in a single pass.

Allocating the data structures is trivial, but maintaining an index for use by the Consumer seems hideous:

int[] foo_array = new int[results.size()];
BarClass[] bar_array = new BarClass[results.size()];

// the trick is providing a stateful iterator across the array:
// - can't just use 'int', it's not effectively final
// - an actual 'final int' would be hilariously wrong
// - "all problems can be solved with a level of indirection"
class Indirection { int iterating = 0; }
final Indirection sigh = new Indirection();
// equivalent possibility is
//    final int[] disgusting = new int[]{ 0 };
// and then access disgusting[0] inside the lambda
// wash your hands after typing that code

results.stream().forEach (fab -> {
    foo_array[sigh.iterating] = fab.foo_int;
    bar_array[sigh.iterating] = fab.bar_object;
    sigh.iterating++;
});

This produces identical arrays as the existing solution using multiple stream loops. And it does so in about half the time, go figure. But the iterator indirection tricks seem so unspeakably ugly, and of course preclude any possibility of populating the arrays in parallel.

Using a pair of ArrayList instances, created with appropriate capacity, would let the Consumer code simply call add for each instance, and no external iterator needed. But ArrayList's toArray(T[]) has to perform a copy of the storage array again, and in the int case there's boxing/unboxing on top of that.

(edit: The answers to the "possible duplicate" question all talk about only maintaining the indices in a stream, and using direct array indexing to get to the actual data during filter/map calls, along with a note that it doesn't really work if the data isn't accessible by direct index. While this question has a List and is "directly indexable" only from a viewpoint of "well, List#get exists, technically". If the results collection above is a LinkedList, for example, then calling an O(n) get N times with nonconsecutive index would be... bad.)

Are there other, better, possibilities that I'm missing? I thought a custom Collector might do it, but I can't figure out how to maintain the state there either and never even got as far as scratch code.

回答1:

As the size of stream is known, there is no reason of reinventing the wheel again. The simplest solution is usually the best one. The second approach you have shown is nearly there - just use AtomicIntegeras array index and you will achieve your goal - single pass over data, and possible parralel stream execution ( due to AtomicInteger).

AtomicInteger index=new AtomicInteger()
results.parallelStream().forEach (fab -> {
    int idx=index.getAndIncrement();
    foo_array[idx] = fab.foo_int;
    bar_array[idx] = fab.bar_object;
});

Thread safe for parralel execution. One iteratio over whole collection

回答2:

If your prerequisites are that both, iterating the list and accessing the list via an index, are expensive operations, there is no chance of getting a benefit from the parallel Stream processing. You can try to go with this answer, if you don’t need the result values in the original list order.

Otherwise, you can’t benefit from the parallel Stream processing as it requires the source to be able to efficiently split its contents into two halves, which implies either, random access or fast iteration. If the source has no customized spliterator, the default implementation will try to enable parallel processing via buffering elements into an array, which already implies iterating before the parallel processing even starts and having additional array storage costs where your sole operation is an array storage operation anyway.

When you accept that there is no benefit from parallel processing, you can stay with your sequential solution, but solve the ugliness of the counter by moving it into the Consumer. Since lambda expressions don’t support this, you can turn to the good old anonymous inner class:

int[]      foo_array = new int[results.size()];
BarClass[] bar_array = new BarClass[results.size()];

results.forEach(new Consumer<AFooAndABarWalkIntoABar>() {
    int index=0;
    public void accept(AFooAndABarWalkIntoABar t) {
        foo_array[index]=t.foo_int;
        bar_array[index]=t.bar_object;
        index++;
    }
});

Of course, there’s also the often-overlooked alternative of the good old for-loop:

int[]      foo_array = new int[results.size()];
BarClass[] bar_array = new BarClass[results.size()];

int index=0;
for(AFooAndABarWalkIntoABar t: results) {
    foo_array[index]=t.foo_int;
    bar_array[index]=t.bar_object;
    index++;
}

I wouldn’t be surprised, if this beats all other alternatives performance-wise for your scenario…

回答3:

A way to reuse an index in a stream is to wrap your lambda in an IntStream that is in charge of incrementing the index:

IntStream.range(0, results.size()).forEach(i -> {
    foo_array[i] = results.get(i).foo_i;
    bar_array[i] = results.get(i).bar_object;
});

With regards to Antoniossss's answer, using an IntStream seems like a slightly preferable option to using AtomicInteger:

It also works with parallel();
Two less local variable;
Leaves the Stream API in charge of parallel processing;
Two less lines of code.

EDIT: as Mikhail Prokhorov pointed out, calling the get method twice on implementations such as LinkedList will be slower than other solutions, given the O(n) complexity of their implementations of get. This can be fixed with:

AFooAndABarWalkIntoABar temp = results.get(i);
foo_array[i] = temp.foo_i;
bar_array[i] = temp.bar_object;

来源：https://stackoverflow.com/questions/41926920/java-8-extracting-a-pair-of-arrays-out-of-a-streampair

标签

java

java-8

java-stream

Java 8: Extracting a pair of arrays out of a Stream<Pair>

问题

回答1:

回答2:

回答3: