micro-architecture

How does the indexing of the Ice Lake's 48KiB L1 data cache work?

孤街浪徒 提交于 2020-01-24 04:27:05
问题 The Intel manual optimization (revision September 2019) shows a 48 KiB 8-way associative L1 data cache for the Ice Lake microarchitecture. 1 Software-visible latency/bandwidth will vary depending on access patterns and other factors. This baffled me because: There are 96 sets (48 KiB / 64 / 8), which is not a power of two. The indexing bits of a set and the indexing bits of the byte offset add to more than 12 bits, this makes the cheap-PIPT-as-VIPT-trick not available for 4KiB pages. All in

Why jnz requires 2 cycles to complete in an inner loop

佐手、 提交于 2020-01-20 08:07:45
问题 I'm on an IvyBridge. I found the performance behavior of jnz inconsistent in inner loop and outer loop. The following simple program has an inner loop with fixed size 16: global _start _start: mov rcx, 100000000 .loop_outer: mov rax, 16 .loop_inner: dec rax jnz .loop_inner dec rcx jnz .loop_outer xor edi, edi mov eax, 60 syscall perf tool shows the outer loop runs 32c/iter. It suggests the jnz requires 2 cycles to complete. I then search in Agner's instruction table, conditional jump has 1-2

What is the minimal number of dependency chains to maximize the execution throughput?

怎甘沉沦 提交于 2019-12-25 09:48:09
问题 Given a chain of instructions linked by true dependencies and repeated periodically (i.e. a loop), for example (a->b->c)->(a->b->c)->... Assuming that it can be split into several shorter and independent sub-dependency chains to benefit from out-of-order execution : (a0->b0->c0)->(a0->b0->c0)->... (a1->b1->c1)->(a1->b1->c1)->... The out-of-order engine schedules each instruction to the corresponding CPU unit which have a latency and a reciprocal throughput. What is the optimal number of sub

About the RIDL vulnerabilities and the “replaying” of loads

拜拜、爱过 提交于 2019-12-07 17:46:58
问题 I'm trying to understand the RIDL class of vulnerability. This is a class of vulnerabilities that is able to read stale data from various micro-architectural buffers. Today the known vulnerabilities exploits: the LFBs, the load ports, the eMC and the store buffer. The paper linked is mainly focused on LFBs. I don't understand why the CPU would satisfy a load with the stale data in an LFB. I can imagine that if a load hits in L1d it is internally "replayed" until the L1d brings data into an

In x86 Intel VT-X non-root mode, can an interrupt be delivered at every instruction boundary?

杀马特。学长 韩版系。学妹 提交于 2019-12-07 08:47:25
问题 Other than certain normal specified conditions where interrupts are not delivered to the virtual processor (cli, if=0, etc), are all instructions in the guest actually interruptible? That is to say, when an incoming hardware interrupt is given to the LAPIC then to the processor, supposedly some internal magic happens to translate that to a virtual interrupt to the guest (using virtual APIC, no exiting). When that happens, does the currently executing instruction immediately serialize the OOO

Why isn't there a data bus which is as wide as the cache line size?

半城伤御伤魂 提交于 2019-11-30 09:17:02
When a cache miss occurs, the CPU fetches a whole cache line from main memory into the cache hierarchy. (typically 64 bytes on x86_64) This is done via a data bus, which is only 8 byte wide on modern 64 bit systems. (since the word size is 8 byte) EDIT: "Data bus" means the bus between the CPU die and the DRAM modules in this context. This data bus width does not necessarily correlate with the word size. Depending on the strategy the actually requested address gets fetched at first, and then the rest of the cache line gets fetched sequentially. It would seem much faster if there was a bus with

Why jnz requires 2 cycles to complete in an inner loop

末鹿安然 提交于 2019-11-29 14:22:05
I'm on an IvyBridge. I found the performance behavior of jnz inconsistent in inner loop and outer loop. The following simple program has an inner loop with fixed size 16: global _start _start: mov rcx, 100000000 .loop_outer: mov rax, 16 .loop_inner: dec rax jnz .loop_inner dec rcx jnz .loop_outer xor edi, edi mov eax, 60 syscall perf tool shows the outer loop runs 32c/iter. It suggests the jnz requires 2 cycles to complete. I then search in Agner's instruction table, conditional jump has 1-2 "reciprocal throughput", with a comment "fast if no jump". At this point I start to believe the above

Why isn't there a data bus which is as wide as the cache line size?

不问归期 提交于 2019-11-29 14:09:44
问题 When a cache miss occurs, the CPU fetches a whole cache line from main memory into the cache hierarchy. (typically 64 bytes on x86_64) This is done via a data bus, which is only 8 byte wide on modern 64 bit systems. (since the word size is 8 byte) EDIT: "Data bus" means the bus between the CPU die and the DRAM modules in this context. This data bus width does not necessarily correlate with the word size. Depending on the strategy the actually requested address gets fetched at first, and then

Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?

天大地大妈咪最大 提交于 2019-11-26 14:55:49
问题 First I have the below setup on an IvyBridge, I will insert measuring payload code in the commented location. The first 8 bytes of buf store the address of buf itself, I use this to create loop-carried dependency: section .bss align 64 buf: resb 64 section .text global _start _start: mov rcx, 1000000000 mov qword [buf], buf mov rax, buf loop: ; I will insert payload here ; as is described below dec rcx jne loop xor rdi, rdi mov rax, 60 syscall case 1: I insert into the payload location: mov