cpu-architecture

x86-64 usage of LFENCE

北城余情 提交于 2019-12-01 02:20:44
问题 I'm trying to understand the right way to use fences when measuring time with RDTSC/RDTSCP. Several questions on SO related to this have already been answered elaborately. I have gone through a few of them. I have also gone through this really helpful article on the same topic: http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf However, in another online blog, there's an example of using LFENCE instead of CPUID on x86. I was

How is load->store reordering possible with in-order commit?

允我心安 提交于 2019-12-01 02:17:06
问题 ARM allows the reordering loads with subsequent stores, so that the following pseudocode: // CPU 0 | // CPU 1 temp0 = x; | temp1 = y; y = 1; | x = 1; can result in temp0 == temp1 == 1 (and, this is observable in practice as well). I'm having trouble understanding how this occurs; it seems like in-order commit would prevent it (which, it was my understanding, is present in pretty much all OOO processors). My reasoning goes "the load must have its value before it commits, it commits before the

How does mtune actually work?

五迷三道 提交于 2019-11-30 23:50:30
问题 There's this related question: GCC: how is march different from mtune? However, the existing answers don't go much further than the GCC manual itself. At most, we get: If you use -mtune , then the compiler will generate code that works on any of them, but will favour instruction sequences that run fastest on the specific CPU you indicated. and The -mtune=Y option tunes the generated code to run faster on Y than on other CPUs it might run on. But exactly how does GCC favor one specific

Why the %r0 of SPARC or MIPS, is always 0?

丶灬走出姿态 提交于 2019-11-30 23:33:16
I know that when you read %r0 in SPARC CPU (and MIPS), always return 0, but I like to know Why ? What design decision is behind this and why ? It's just the way the CPU was designed. Ensuring that r0 is always set to zero is, for one thing, a way to avoid potentially costly memory accesses for a very common value. On one hand (reading), it's handy to have a register set aside to contain the value of zero so that you can use it. Otherwise, you would have to load zero into a register yourself. Many RISC processors tend to favour data manipulation in registers, accessing memory only for load and

Can the LSD issue uOPs from the next iteration of the detected loop?

空扰寡人 提交于 2019-11-30 18:50:14
I was playing investigating the capabilities of the branch unit on port 0 of my Haswell starting with a very simple loop: BITS 64 GLOBAL _start SECTION .text _start: mov ecx, 10000000 .loop: dec ecx ;| jz .end ;| 1 uOP (call it D) jmp .loop ;| 1 uOP (call it J) .end: mov eax, 60 xor edi, edi syscall Using perf we see that the loop runs at 1c/iter Performance counter stats for './main' (50 runs): 10,001,055 uops_executed_port_port_6 ( +- 0.00% ) 9,999,973 uops_executed_port_port_0 ( +- 0.00% ) 10,015,414 cycles:u ( +- 0.02% ) 23 resource_stalls_rs ( +- 64.05% ) My interpretations of these

What is the benefit of the MOESI cache coherency protocol over MESI?

爱⌒轻易说出口 提交于 2019-11-30 18:17:26
问题 I was wondering what benefits MOESI has over the MESI cache coherency protocol, and which protocol is currently favored for modern architectures. Oftentimes benefits don't translate to implementation if the costs don't allow it. Quantitative performance results of MOESI over MESI would be nice to see also. 回答1: AMD uses MOESI, Intel uses MESIF. (I don't know about non-x86 cache details.) MOESI allows sending dirty cache lines directly between caches instead of writing back to a shared outer

Is Intel QuickPath Interconnect (QPI) used by processors to access memory?

扶醉桌前 提交于 2019-11-30 17:09:44
问题 I have read An Introduction to the Intel® QuickPath Interconnect. The document does not mention that QPI is used by processors to access memory. So I think that processors don't access memory through QPI. Is my understanding correct? 回答1: Yes, QPI is used to access all remote memory on multi-socket systems, and much of its design and performance is intended to support such access in a reasonable fashion (i.e., with latency and bandwidth not too much worse than local access). Basically, most

Why does the number of uops per iteration increase with the stride of streaming loads?

点点圈 提交于 2019-11-30 13:42:12
Consider the following loop: .loop: add rsi, OFFSET mov eax, dword [rsi] dec ebp jg .loop where OFFSET is some non-negative integer and rsi contains a pointer to a buffer defined in the bss section. This loop is the only loop in the code. That is, it's not being initialized or touched before the loop. Presumably, on Linux, all of the 4K virtual pages of the buffer will be mapped on-demand to the same physical page. Therefore, the only limit on the buffer size is the number of virtual pages. So we can easily experiment with very large buffers. The loop consists of 4 instructions. Each

x86 32 bit opcodes that differ in x86-x64 or entirely removed

∥☆過路亽.° 提交于 2019-11-30 13:03:03
问题 I've looked up Wikipedia for x86 backward compatibility in x86-x64 and it says: x86-64 is fully backwards compatible with 16-bit and 32-bit x86 code.Because the full x86 16-bit and 32-bit instruction sets remain implemented in hardware without any intervening emulation, existing x86 executables run with no compatibility or performance penalties,whereas existing applications that are recoded to take advantage of new features of the processor design may achieve performance improvements. So I've

Do all 64 bit intel architectures support SSSE3/SSE4.1/SSE4.2 instructions?

寵の児 提交于 2019-11-30 11:43:27
I did searched on web and intel Software manual . But am unable to confirm if all Intel 64 architectures support upto SSSE3 or upto SSE4.1 or upto SSE4.2 or AVX etc. So that I would be able to use minimum SIMD supported instructions in my programme. Please help. Chuck Walbourn A x64 native (AMD64 or Intel 64) processor is only mandated to support SSE and SSE2. SSE3 is supported by Intel Pentium 4 processors (“Prescott”), AMD Athlon 64 (“revision E”), AMD Phenom, and later processors. This means most, but not quite all, x64 capable CPUs should support SSE3. Supplemental SSE3 (SSSE3) is