Why the bounds check doesn't get eliminated?

谁说我不能喝 提交于 2019-11-27 22:00:37
  1. No, this is evidently an effect of not enough smart bounds check elimination.

I've extended a benchmark by Marko Topolnik:

@OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
@OperationsPerInvocation(BCElimination.N)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 10, time = 1)
@State(Scope.Thread)
@Threads(1)
@Fork(2)
public class BCElimination {
    public static final int N = 1024;
    private static final Unsafe U;
    private static final long INT_BASE;
    private static final long INT_SCALE;
    static {
        try {
            Field f = Unsafe.class.getDeclaredField("theUnsafe");
            f.setAccessible(true);
            U = (Unsafe) f.get(null);
        } catch (Exception e) {
            throw new IllegalStateException(e);
        }

        INT_BASE = U.arrayBaseOffset(int[].class);
        INT_SCALE = U.arrayIndexScale(int[].class);
    }

    private final int[] table = new int[BCElimination.N];

    @Setup public void setUp() {
        final Random random = new Random();
        for (int i=0; i<table.length; ++i) table[i] = random.nextInt();
    }

    @GenerateMicroBenchmark public int normalIndex() {
        int result = 0;
        final int[] table = this.table;
        int x = 0;
        for (int i=0; i<=table.length-1; ++i) {
            x += i;
            final int j = x & (table.length-1);
            result ^= table[i] + j;
        }
        return result;
    }

    @GenerateMicroBenchmark public int maskedIndex() {
        int result = 0;
        final int[] table = this.table;
        int x = 0;
        for (int i=0; i<=table.length-1; ++i) {
            x += i;
            final int j = x & (table.length-1);
            result ^= i + table[j];
        }
        return result;
    }

    @GenerateMicroBenchmark public int maskedIndexUnsafe() {
        int result = 0;
        final int[] table = this.table;
        long x = 0;
        for (int i=0; i<=table.length-1; ++i) {
            x += i * INT_SCALE;
            final long j = x & ((table.length-1) * INT_SCALE);
            result ^= i + U.getInt(table, INT_BASE + j);
        }
        return result;
    }
}

Results:

Benchmark                                Mean   Mean error    Units
BCElimination.maskedIndex               1,235        0,004    ns/op
BCElimination.maskedIndexUnsafe         1,092        0,007    ns/op
BCElimination.normalIndex               1,071        0,008    ns/op


2. The second question is for hotspot-dev mailing lists rather than StackOverflow, IMHO.

To start off, the main difference between your two tests is definitely in bounds check elimination; however, the way this influences the machine code is far from what the naïve expectation would suggest.

My conjecture:

The bounds check figures more strongly as a loop exit point than as additional code which introduces overhead.

The loop exit point prevents the following optimization which I have culled from the emitted machine code:

  • the loop is unrolled (this is true in all cases);
  • additionaly, the fetching from the array stage is done first for all unrolled steps, then the xoring into accumulator is done for all the steps.

If the loop can break out at any step, this staging would result in work performed for loop steps which were never actually taken.

Consider this slight modification of your code:

@OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
@OperationsPerInvocation(Measure.N)
@Warmup(iterations = 3, time = 1)
@Measurement(iterations = 5, time = 1)
@State(Scope.Thread)
@Threads(1)
@Fork(1)
 public class Measure {
  public static final int N = 1024;

  private final int[] table = new int[N];
  @Setup public void setUp() {
    final Random random = new Random();
    for (int i = 0; i < table.length; ++i) {
      final int x = random.nextInt();
      table[i] = x == 0? 1 : x;
    }
  }
  @GenerateMicroBenchmark public int normalIndex() {
    int result = 0;
    final int[] table = this.table;
    int x = 0;
    for (int i = 0; i <= table.length - 1; ++i) {
      x += i;
      final int j = x & (table.length - 1);
      final int entry = table[i];
      result ^= entry + j;
      if (entry == 0) break;
    }
    return result;
  }
  @GenerateMicroBenchmark public int maskedIndex() {
    int result = 0;
    final int[] table = this.table;
    int x = 0;
    for (int i = 0; i <= table.length - 1; ++i) {
      x += i;
      final int j = x & (table.length - 1);
      final int entry = table[j];
      result ^= i + entry;
      if (entry == 0) break;
    }
    return result;
  }
}

There is just one difference: I have added the check

if (entry == 0) break;

to give the loop a way to exit prematurely on any step. (I also introduced a guard to ensure no array entries are actually 0.)

On my machine, this is the result:

Benchmark                   Mode   Samples         Mean   Mean error    Units
o.s.Measure.maskedIndex     avgt         5        1.378        0.229    ns/op
o.s.Measure.normalIndex     avgt         5        0.924        0.092    ns/op

the "normal index" variant is substantially faster, as generally expected.

However, let us remove the additional check:

// if (entry == 0) break;

Now my results are these:

Benchmark                   Mode   Samples         Mean   Mean error    Units
o.s.Measure.maskedIndex     avgt         5        1.130        0.065    ns/op
o.s.Measure.normalIndex     avgt         5        1.229        0.053    ns/op

"Masked index" responded predictably (reduced overhead), but "normal index" is suddenly much worse. This is apparently due to a bad fit between the additional optimization step and my specific CPU model.

My point:

The performance model at such a detailed level is very unstable and, as witnessed on my CPU, even erratic.

In order to safely eliminate that bounds check, it is necessary to prove that

h & (table.length - 1)

is guaranteed to produce a valid index into table. It won't if table.length is zero (as you'll end up with & -1, an effective-noop). It also won't usefully do it if table.length is not a power of 2 (you'll lose information; consider the case where table.length is 17).

How can the HotSpot compiler know that these bad conditions are not true? It has to be more conservative than a programmer can be, as the programmer can know more about the high-level constraints on the system (e.g., that the array is never empty and always as a number of elements that is a power-of-two).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!