Compact stream expression for transposing double[][] Matrix

I want to tranpose a double[][] matrix with the most compact and efficient expression possible. Right now I have this:

public static Function<double[][], double[][]> transpose() {
    return (m) -> {
        final int rows = m.length;
        final int columns = m[0].length;
        double[][] transpose = new double[columns][rows];
        range(0, rows).forEach(r -> {
            range(0, columns).forEach(c -> {
                transpose[c][r] = m[r][c];
            });
        });
        return transpose;
    };
}

Thoughts?

Tunaki

You could have:

public static UnaryOperator<double[][]> transpose() {
    return m -> {
        return range(0, m[0].length).mapToObj(r ->
            range(0, m.length).mapToDouble(c -> m[c][r]).toArray()
        ).toArray(double[][]::new);
    };
}

This code does not use forEach but prefers mapToObj and mapToDouble for mapping each row to their transposition. I also changed Function<double[][], double[][]> to UnaryOperator<double[][]> since the return type is the same.

However, it probably won't be more efficient that having a simple for loop like in assylias's answer.

Sample code:

public static void main(String[] args) {
    double[][] m = { { 2, 3 }, { 1, 2 }, { -1, 1 } };
    double[][] tm = transpose().apply(m);
    System.out.println(Arrays.deepToString(tm)); // prints [[2.0, 1.0, -1.0], [3.0, 2.0, 1.0]]
}

I've realized a JMH benchmark comparing the code above, the for loop version, and the code above ran in parallel. All three methods are called with random square matrices having size 100, 1000 and 3000. The results are that for small matrices, the for loop version is faster but with bigger matrices the parallel Stream solution is indeed better in terms of performance (Windows 10, JDK 1.8.0_66, i5-3230M @ 2.60 GHz):

Benchmark                           (matrixSize)  Mode  Cnt    Score    Error  Units
StreamTest.forLoopTranspose                  100  avgt   30    0,026 ±  0,001  ms/op
StreamTest.forLoopTranspose                 1000  avgt   30   14,653 ±  0,205  ms/op
StreamTest.forLoopTranspose                 3000  avgt   30  222,212 ± 11,449  ms/op
StreamTest.parallelStreamTranspose           100  avgt   30    0,113 ±  0,007  ms/op
StreamTest.parallelStreamTranspose          1000  avgt   30    7,960 ±  0,207  ms/op
StreamTest.parallelStreamTranspose          3000  avgt   30  122,587 ±  7,100  ms/op
StreamTest.streamTranspose                   100  avgt   30    0,040 ±  0,003  ms/op
StreamTest.streamTranspose                  1000  avgt   30   14,059 ±  0,444  ms/op
StreamTest.streamTranspose                  3000  avgt   30  216,741 ±  5,738  ms/op

Benchmark code:

@Warmup(iterations = 10, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 10, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@Fork(3)
public class StreamTest {

    private static final UnaryOperator<double[][]> streamTranspose() {
        return m -> {
            return range(0, m[0].length).mapToObj(r ->
                range(0, m.length).mapToDouble(c -> m[c][r]).toArray()
            ).toArray(double[][]::new);
        };
    }

    private static final UnaryOperator<double[][]> parallelStreamTranspose() {
        return m -> {
            return range(0, m[0].length).parallel().mapToObj(r ->
                range(0, m.length).parallel().mapToDouble(c -> m[c][r]).toArray()
            ).toArray(double[][]::new);
        };
    }

    private static final Function<double[][], double[][]> forLoopTranspose() {
        return m -> {
            final int rows = m.length;
            final int columns = m[0].length;
            double[][] transpose = new double[columns][rows];
            for (int r = 0; r < rows; r++)
                  for (int c = 0; c < columns; c++)
                    transpose[c][r] = m[r][c];
            return transpose;
        };
    }

    @State(Scope.Benchmark)
    public static class MatrixContainer {

        @Param({ "100", "1000", "3000" })
        private int matrixSize;

        private double[][] matrix;

        @Setup(Level.Iteration)
        public void setUp() {
            ThreadLocalRandom random = ThreadLocalRandom.current();
            matrix = random.doubles(matrixSize).mapToObj(i -> random.doubles(matrixSize).toArray()).toArray(double[][]::new);
        }

    }

    @Benchmark
    public double[][] streamTranspose(MatrixContainer c) {
        return streamTranspose().apply(c.matrix);
    }

    @Benchmark
    public double[][] parallelStreamTranspose(MatrixContainer c) {
        return parallelStreamTranspose().apply(c.matrix);
    }

    @Benchmark
    public double[][] forLoopTranspose(MatrixContainer c) {
        return forLoopTranspose().apply(c.matrix);
    }

}

As compact and more efficient:

for (int r = 0; r < rows; r++)
  for (int c = 0; c < cols; c++)
    transpose[c][r] = m[r][c];

Note that if you have a Matrix class that holds a double[][], an alternative option would be to return a view that has the same underlying array but swaps the columns/rows indices. You would save on copying but you may get worse performance on iteration due to worse cache locality.

If you assume a rectangular input (as your original code seems to rely on), you could write it as

public static Function<double[][], double[][]> transpose() {
    return m -> range(0, m[0].length)
        .mapToObj(c->range(0, m.length).mapToDouble(r->m[r][c]).toArray())
        .toArray(double[][]::new);
}

This could run in parallel but I suppose you’d need a damn big matrix to get a benefit of it.

My advice: for simple low-level math you should use plain old for loops instead of Stream API. Also you should benchmark such code very carefully.

As for @Tunaki benchmark. First, you should not limit single measurement with 1 microsecond. The results for matrixSize = 100 are complete junk: 0,093 ± 0,054 and 0,237 ± 0,134: the error is more than 50%. Note that time measurement which performed before and after each iteration is not a magic and takes time too. And such a small interval can be easily spoiled by some Windows service which suddenly woke up, took some CPU cycles to check something, then go to sleep again. I usually set every warmup/measurement time to 500ms, this number looks comfortable for me.

Second, when testing Stream API with very simple payload (such as copying numbers to primitive array), you should always test with type profile pollution as it really matters. In clean benchmark the JIT compiler can inline everything into single method, because it knows, for example, that after some range you always call the same mapToObj with the same lambda expression. But in real application it's not the same. I modified the MatrixContainer class this way:

@State(Scope.Benchmark)
public static class MatrixContainer {
    @Param({"true", "false"})
    private boolean pollute;

    @Param({ "100", "1000", "3000" })
    private int matrixSize;

    private double[][] matrix;

    @Setup(Level.Iteration)
    public void setUp() {
        ThreadLocalRandom random = ThreadLocalRandom.current();
        matrix = random.doubles(matrixSize)
                       .mapToObj(i -> random.doubles(matrixSize).toArray())
                       .toArray(double[][]::new);
        if(!pollute) return;
        // do some seemingly harmless operations which will
        // poison JIT compiler type profile with some other lambdas
        for(int i=0; i<100; i++) {
           range(0, 1000).map(x -> x+2).toArray();
           range(0, 1000).map(x -> x+5).toArray();
           range(0, 1000).mapToObj(x -> x*2).toArray();
           range(0, 1000).mapToObj(x -> x*3).toArray();
        }
    }
}

Also I set 5 forks as for Stream API JIT-compiler may behave differently from run to run. Compilation goes in background thread and profiling info may differ at the compilation point due to race which may change the results of compilation significatly. So within fork the results will be the same, but between forks they might be completely different.

My results are (Windows 7, Oracle JVM 8u45 64bit, some not-very-new i5-2410 laptop):

Benchmark                    (matrixSize)  (pollute)  Mode  Cnt    Score    Error  Units
StreamTest.forLoopTranspose           100       true  avgt   50    0,033 ±  0,001  ms/op
StreamTest.forLoopTranspose           100      false  avgt   50    0,032 ±  0,001  ms/op
StreamTest.forLoopTranspose          1000       true  avgt   50   17,094 ±  0,060  ms/op
StreamTest.forLoopTranspose          1000      false  avgt   50   17,065 ±  0,080  ms/op
StreamTest.forLoopTranspose          3000       true  avgt   50  260,173 ±  7,855  ms/op
StreamTest.forLoopTranspose          3000      false  avgt   50  258,774 ±  7,557  ms/op
StreamTest.streamTranspose            100       true  avgt   50    0,096 ±  0,001  ms/op
StreamTest.streamTranspose            100      false  avgt   50    0,055 ±  0,012  ms/op
StreamTest.streamTranspose           1000       true  avgt   50   21,497 ±  0,439  ms/op
StreamTest.streamTranspose           1000      false  avgt   50   15,883 ±  0,265  ms/op
StreamTest.streamTranspose           3000       true  avgt   50  272,806 ±  8,534  ms/op
StreamTest.streamTranspose           3000      false  avgt   50  260,515 ±  9,159  ms/op

Now you have much less errors and see that type pollution makes the stream results worse while does not affect for-loop results. For matrices like 100x100 the difference is quite significant.

I'm adding an implementation example that includes the parallel switch. I'm curious what you all think of it.

/**
 * Returns a {@link UnaryOperator} that transposes the matrix.
 * 
 * Example {@code transpose(true).apply(m);}
 * 
 * @param parallel
 *            Whether to perform the transpose concurrently.
 */
public static UnaryOperator<ArrayMatrix> transpose(boolean parallel) {
    return (m) -> {
        double[][] data = m.getData();
        IntStream stream = range(0, m.getColumnDimension());
        stream = parallel ? stream.parallel() : stream;

        double[][] transpose =
                stream.mapToObj(
                        column -> range(0, data.length).mapToDouble(row -> data[row][column]).toArray())
                        .toArray(double[][]::new);
        return new ArrayMatrix(transpose);
    };
}

来源：https://stackoverflow.com/questions/34861469/compact-stream-expression-for-transposing-double-matrix

标签

java

math

java-8

java-stream