How can assigning a variable result in a serious performance drop while the execution order is (nearly) untouched?

别来无恙 提交于 2019-12-02 20:36:43

It's likely that two volatile variables a and b are too close to each other, they fall in the same cache line; although CPU A only reads/writes variable a, and CPU B only reads/writes variable b, they are still coupled to each other through the same cache line. Such problems are called false sharing.

In your example, we have two allocation schemes:

new Thread                               new Thread
new Container               vs           new Thread
new Thread                               ....
new Container                            new Container
....                                     new Container

In the first scheme, it's very unlikely that two volatile variables are close to each other. In the 2nd scheme, it's almost certainly the case.

CPU caches don't work with individual words; instead, they deal with cache lines. A cache line is a continuous chunk of memory, say 64 neighboring bytes. Usually this is nice - if a CPU accessed a cell, it's very likely that it will access the neighboring cells too. Except in your example, that assumption is not only invalid, but detrimental.

Suppose a and b fall in the same cache line L. When CPU A updates a, it notifies other CPUs that L is dirty. Since B caches L too, because it's working on b, B must drop its cached L. So next time B needs to read b, it must reload L, which is costly.

If B must access main memory to reload, that is extremely costly, it's usually 100X slower.

Fortunately, A and B can communicate directly about the new values without going through main memory. Nevertheless it takes extra time.

To verify this theory, you can stuff extra 128 bytes in Container, so that two volatile variable of two Container will not fall in the same cache line; then you should observe that the two schemes take about the same time to execute.

Lession learned: usually CPUs assume that adjecent variables are related. If we want independent variables, we better place them far away from each other.

Well, you're writing to a volatile variable, so I suspect that's forcing a memory barrier - undoing some optimization which can otherwise be achieved. The JVM doesn't know that that particular field isn't going to be observed on another thread.

EDIT: As noted, there are problems with the benchmark itself, such as printing while the timer is running. Also, it's usually a good idea to "warm up" the JIT before starting timing - otherwise you're measuring time which wouldn't be significant in a normal long-running process.

I am not an expert in the internals of Java, but I read your question and find it fascinating. If I had to guess, I think what you have discovered:

  1. Does NOT have anything to do with the instantiation of the volitale property. However, from your data, where the property gets instantiated affects how expensive it is to read/write to it.

  2. Does have to do with finding the reference of the volitale property at runtime. That is, I would be interested to see how the delay grows with more threads that loop more often. Is the number of calls to the volitale property what is causing the delay, or the addition itself, or the writing of the new value.

I would have to guess that: there is probably a JVM optimization that attempts to quickly instantiate the property, and later, if there is time, to alter the property in memory so it is easier to read/write to it. Maybe there is a (1) quick-to-create read/write queue for volitale properties, and a (2) hard-to-create but quick to call queue, and the JVM begins with (1) and later alters the volitale property to (2).

Perhaps if you prepare() right before the run() method gets called, the JVM does not have enough free cycles to optimize from (1) to (2).

The way to test this answer would be to:

prepare(), sleep(), run() and see if you get the same delay. If the sleep is the only thing that is causing for the optimization to take place, then it could mean the JVM needs cycles to optimize the volitale property

OR (a bit more risky) ...

prepare() and run() the threads, later in the middle of the loop, to either pause() or sleep() or somehow stop access to the property in a way that the JVM can attempt to move it to (2).

I'd be interested to see what you find out. Interesting question.

Well, the big difference I see is in the order in which objects are allocated. When preparing after the constructor, your Container allocations are interleaved with your Thread allocations. When preparing prior to execution, your Threads are all allocated first, then your Containers are all allocated.

I don't know a whole lot about memory issues in multi-processor environments, but if I had to guess, maybe in the second case the Container allocations are more likely to be allocated in the same memory page, and perhaps the processors are slowed down due to contention for the same memory page?

[edit] Following this line of thought, I'd be interested to see what happens if you don't try to write back to the variable, and only read from it, in the Thread's run method. I would expect the timings difference to go away.

[edit2] See irreputable's answer; he explains it much better than I could

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!