Fastest x86 assembly code to synchronize access to an array? [closed]

匿名 (未验证) 提交于 2019-12-03 02:56:01

问题:

What is the fastest x86 assembly code to synchronize access to an array in memory?

To be more precise: We have a malloc'ed continuous single-paged region in memory and the OS will not page-out this region for the duration of our experiment. One thread will write to the array, one thread will read from the array. the array is small, but larger than the atomic-write capability of your cpu (so that a separate lock is acutally required)

"fastest": the effective speed: Do not just assume the length of bytecode is significant but take into account the caching behavior of the lock and branching behavior regarding surrounding code.

It has to work on x86-32 and/or x86-64

It has to work on-top of (or descendents of) Windows since XP, Linux since kernel 2.2, or MaxOs X (in user-mode).

Please no "it depends"-responses: If it depends on anything I have not specified here just make up your own example(s) and state what is fastest in that/those case(s).

Post code! (This is to prevent vague descriptions)

Post not only your 2-line LOCK + CMPXCHG compare&swap but show us how you integrate it with the read instructions in the one thread and the write-instructions in the other.

If you like, explain your tweaks for cache-optimality and how to avoid branch-mispredictions if the branch-target is dependant on (1) whether you get the lock or not (2) what the first byte of a larger-read is.

If you like distinguish between multiprocessing and task-switching: how will your code perform if the threads are not performed on 2 cpus but just get hold of one?

回答1:

Really, the answer is "it depends". What's the usage pattern of your array? Is it read-mostly? Is it update-mostly and you can get away with imprecise results on reading (using per-cpu arrays)? Updates are so infrequent that RCU would give serious performance improvements?

There are lots of tradeoffs here, see Paul McKenney's book: Is Parallel Programming Hard, And, If So, What Can You Do About It?



回答2:

I don't get it. Bus-locking (lock prefix or xchg mem,reg instruction) and speed have little to do with each other. It's about physically synchronizing the CPU with the slowest active device in your system - which might be connected via the 33 MHz PCI or some such - and you can bet it will be much slower than a RAM access that was not in the cache. So expect 300-3000 CPU clock cycles depending on how long you need to wait for the device. If no devices are active you'll still need to wait for the respective buses to acknowledge the lock.

Fastest code? Forget it. You need to either accept that this is how bus locks work or find other ways to synchronize that do not require bus-locking.



回答3:

If locking performance is important, you're doing something wrong.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!