I want to tranpose a double[][]
matrix with the most compact and efficient expression possible. Right now I have this:
public static Function<double[][], double[][]> transpose() {
return (m) -> {
final int rows = m.length;
final int columns = m[0].length;
double[][] transpose = new double[columns][rows];
range(0, rows).forEach(r -> {
range(0, columns).forEach(c -> {
transpose[c][r] = m[r][c];
});
});
return transpose;
};
}
Thoughts?
You could have:
public static UnaryOperator<double[][]> transpose() {
return m -> {
return range(0, m[0].length).mapToObj(r ->
range(0, m.length).mapToDouble(c -> m[c][r]).toArray()
).toArray(double[][]::new);
};
}
This code does not use forEach
but prefers mapToObj
and mapToDouble
for mapping each row to their transposition. I also changed Function<double[][], double[][]>
to UnaryOperator<double[][]>
since the return type is the same.
However, it probably won't be more efficient that having a simple for loop like in assylias's answer.
Sample code:
public static void main(String[] args) {
double[][] m = { { 2, 3 }, { 1, 2 }, { -1, 1 } };
double[][] tm = transpose().apply(m);
System.out.println(Arrays.deepToString(tm)); // prints [[2.0, 1.0, -1.0], [3.0, 2.0, 1.0]]
}
I've realized a JMH benchmark comparing the code above, the for loop version, and the code above ran in parallel. All three methods are called with random square matrices having size 100, 1000 and 3000. The results are that for small matrices, the for
loop version is faster but with bigger matrices the parallel Stream solution is indeed better in terms of performance (Windows 10, JDK 1.8.0_66, i5-3230M @ 2.60 GHz):
Benchmark (matrixSize) Mode Cnt Score Error Units
StreamTest.forLoopTranspose 100 avgt 30 0,026 ± 0,001 ms/op
StreamTest.forLoopTranspose 1000 avgt 30 14,653 ± 0,205 ms/op
StreamTest.forLoopTranspose 3000 avgt 30 222,212 ± 11,449 ms/op
StreamTest.parallelStreamTranspose 100 avgt 30 0,113 ± 0,007 ms/op
StreamTest.parallelStreamTranspose 1000 avgt 30 7,960 ± 0,207 ms/op
StreamTest.parallelStreamTranspose 3000 avgt 30 122,587 ± 7,100 ms/op
StreamTest.streamTranspose 100 avgt 30 0,040 ± 0,003 ms/op
StreamTest.streamTranspose 1000 avgt 30 14,059 ± 0,444 ms/op
StreamTest.streamTranspose 3000 avgt 30 216,741 ± 5,738 ms/op
Benchmark code:
@Warmup(iterations = 10, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 10, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@Fork(3)
public class StreamTest {
private static final UnaryOperator<double[][]> streamTranspose() {
return m -> {
return range(0, m[0].length).mapToObj(r ->
range(0, m.length).mapToDouble(c -> m[c][r]).toArray()
).toArray(double[][]::new);
};
}
private static final UnaryOperator<double[][]> parallelStreamTranspose() {
return m -> {
return range(0, m[0].length).parallel().mapToObj(r ->
range(0, m.length).parallel().mapToDouble(c -> m[c][r]).toArray()
).toArray(double[][]::new);
};
}
private static final Function<double[][], double[][]> forLoopTranspose() {
return m -> {
final int rows = m.length;
final int columns = m[0].length;
double[][] transpose = new double[columns][rows];
for (int r = 0; r < rows; r++)
for (int c = 0; c < columns; c++)
transpose[c][r] = m[r][c];
return transpose;
};
}
@State(Scope.Benchmark)
public static class MatrixContainer {
@Param({ "100", "1000", "3000" })
private int matrixSize;
private double[][] matrix;
@Setup(Level.Iteration)
public void setUp() {
ThreadLocalRandom random = ThreadLocalRandom.current();
matrix = random.doubles(matrixSize).mapToObj(i -> random.doubles(matrixSize).toArray()).toArray(double[][]::new);
}
}
@Benchmark
public double[][] streamTranspose(MatrixContainer c) {
return streamTranspose().apply(c.matrix);
}
@Benchmark
public double[][] parallelStreamTranspose(MatrixContainer c) {
return parallelStreamTranspose().apply(c.matrix);
}
@Benchmark
public double[][] forLoopTranspose(MatrixContainer c) {
return forLoopTranspose().apply(c.matrix);
}
}
As compact and more efficient:
for (int r = 0; r < rows; r++)
for (int c = 0; c < cols; c++)
transpose[c][r] = m[r][c];
Note that if you have a Matrix
class that holds a double[][]
, an alternative option would be to return a view that has the same underlying array but swaps the columns/rows indices. You would save on copying but you may get worse performance on iteration due to worse cache locality.
If you assume a rectangular input (as your original code seems to rely on), you could write it as
public static Function<double[][], double[][]> transpose() {
return m -> range(0, m[0].length)
.mapToObj(c->range(0, m.length).mapToDouble(r->m[r][c]).toArray())
.toArray(double[][]::new);
}
This could run in parallel but I suppose you’d need a damn big matrix to get a benefit of it.
My advice: for simple low-level math you should use plain old for loops instead of Stream API. Also you should benchmark such code very carefully.
As for @Tunaki benchmark. First, you should not limit single measurement with 1 microsecond. The results for matrixSize = 100
are complete junk: 0,093 ± 0,054
and 0,237 ± 0,134
: the error is more than 50%. Note that time measurement which performed before and after each iteration is not a magic and takes time too. And such a small interval can be easily spoiled by some Windows service which suddenly woke up, took some CPU cycles to check something, then go to sleep again. I usually set every warmup/measurement time to 500ms, this number looks comfortable for me.
Second, when testing Stream API with very simple payload (such as copying numbers to primitive array), you should always test with type profile pollution as it really matters. In clean benchmark the JIT compiler can inline everything into single method, because it knows, for example, that after some range
you always call the same mapToObj
with the same lambda expression. But in real application it's not the same. I modified the MatrixContainer
class this way:
@State(Scope.Benchmark)
public static class MatrixContainer {
@Param({"true", "false"})
private boolean pollute;
@Param({ "100", "1000", "3000" })
private int matrixSize;
private double[][] matrix;
@Setup(Level.Iteration)
public void setUp() {
ThreadLocalRandom random = ThreadLocalRandom.current();
matrix = random.doubles(matrixSize)
.mapToObj(i -> random.doubles(matrixSize).toArray())
.toArray(double[][]::new);
if(!pollute) return;
// do some seemingly harmless operations which will
// poison JIT compiler type profile with some other lambdas
for(int i=0; i<100; i++) {
range(0, 1000).map(x -> x+2).toArray();
range(0, 1000).map(x -> x+5).toArray();
range(0, 1000).mapToObj(x -> x*2).toArray();
range(0, 1000).mapToObj(x -> x*3).toArray();
}
}
}
Also I set 5 forks as for Stream API JIT-compiler may behave differently from run to run. Compilation goes in background thread and profiling info may differ at the compilation point due to race which may change the results of compilation significatly. So within fork the results will be the same, but between forks they might be completely different.
My results are (Windows 7, Oracle JVM 8u45 64bit, some not-very-new i5-2410 laptop):
Benchmark (matrixSize) (pollute) Mode Cnt Score Error Units
StreamTest.forLoopTranspose 100 true avgt 50 0,033 ± 0,001 ms/op
StreamTest.forLoopTranspose 100 false avgt 50 0,032 ± 0,001 ms/op
StreamTest.forLoopTranspose 1000 true avgt 50 17,094 ± 0,060 ms/op
StreamTest.forLoopTranspose 1000 false avgt 50 17,065 ± 0,080 ms/op
StreamTest.forLoopTranspose 3000 true avgt 50 260,173 ± 7,855 ms/op
StreamTest.forLoopTranspose 3000 false avgt 50 258,774 ± 7,557 ms/op
StreamTest.streamTranspose 100 true avgt 50 0,096 ± 0,001 ms/op
StreamTest.streamTranspose 100 false avgt 50 0,055 ± 0,012 ms/op
StreamTest.streamTranspose 1000 true avgt 50 21,497 ± 0,439 ms/op
StreamTest.streamTranspose 1000 false avgt 50 15,883 ± 0,265 ms/op
StreamTest.streamTranspose 3000 true avgt 50 272,806 ± 8,534 ms/op
StreamTest.streamTranspose 3000 false avgt 50 260,515 ± 9,159 ms/op
Now you have much less errors and see that type pollution makes the stream results worse while does not affect for-loop results. For matrices like 100x100 the difference is quite significant.
I'm adding an implementation example that includes the parallel switch. I'm curious what you all think of it.
/**
* Returns a {@link UnaryOperator} that transposes the matrix.
*
* Example {@code transpose(true).apply(m);}
*
* @param parallel
* Whether to perform the transpose concurrently.
*/
public static UnaryOperator<ArrayMatrix> transpose(boolean parallel) {
return (m) -> {
double[][] data = m.getData();
IntStream stream = range(0, m.getColumnDimension());
stream = parallel ? stream.parallel() : stream;
double[][] transpose =
stream.mapToObj(
column -> range(0, data.length).mapToDouble(row -> data[row][column]).toArray())
.toArray(double[][]::new);
return new ArrayMatrix(transpose);
};
}
来源:https://stackoverflow.com/questions/34861469/compact-stream-expression-for-transposing-double-matrix