cpu-architecture | 易学教程

boost lockfree spsc_queue cache memory access

阅读更多关于 boost lockfree spsc_queue cache memory access

问题 I need to be extremely concerned with speed/latency in my current multi-threaded project. Cache access is something I'm trying to understand better. And I'm not clear on how lock-free queues (such as the boost::lockfree::spsc_queue) access/use memory on a cache level. I've seen queues used where the pointer of a large object that needs to be operated on by the consumer core is pushed into the queue. If the consumer core pops an element from the queue, I presume that means the element (a

The inner workings of Spectre (v2)

阅读更多关于 The inner workings of Spectre (v2)

问题 I have done some reading about Spectre v2 and obviously you get the non technical explanations. Peter Cordes has a more in-depth explanation but it doesn't fully address a few details. Note: I have never performed a Spectre v2 attack so I do not have hands on experience. I have only read up about about the theory. My understanding of Spectre v2 is that you make an indirect branch mispredict for instance if (input < data.size) . If the Indirect Target Array (which I'm not too sure of the

The inner workings of Spectre (v2)

阅读更多关于 The inner workings of Spectre (v2)

Why the %r0 of SPARC or MIPS, is always 0?

阅读更多关于 Why the %r0 of SPARC or MIPS, is always 0?

问题 I know that when you read %r0 in SPARC CPU (and MIPS), always return 0, but I like to know Why ? What design decision is behind this and why ? 回答1: It's just the way the CPU was designed. Ensuring that r0 is always set to zero is, for one thing, a way to avoid potentially costly memory accesses for a very common value. On one hand (reading), it's handy to have a register set aside to contain the value of zero so that you can use it. Otherwise, you would have to load zero into a register

Why does the number of uops per iteration increase with the stride of streaming loads?

阅读更多关于 Why does the number of uops per iteration increase with the stride of streaming loads?

问题 Consider the following loop: .loop: add rsi, OFFSET mov eax, dword [rsi] dec ebp jg .loop where OFFSET is some non-negative integer and rsi contains a pointer to a buffer defined in the bss section. This loop is the only loop in the code. That is, it's not being initialized or touched before the loop. Presumably, on Linux, all of the 4K virtual pages of the buffer will be mapped on-demand to the same physical page. Therefore, the only limit on the buffer size is the number of virtual pages.

How to determine binary image architecture at runtime?

阅读更多关于 How to determine binary image architecture at runtime?

问题 Crash log contains "Binary Images" section with information about architecture (armv6/armv7) and identifier of all loaded modules. How to determine this information at runtime? (at least, just for application executable) NSBundle has method executableArchitectures, but how to determine which architecture is running? 回答1: Alright time for the long answer. The mach headers of the dyld images in the application contain the information you are looking for. I have added an example that I only

Does a thread waiting on IO also block a core?

阅读更多关于 Does a thread waiting on IO also block a core?

问题 In the synchronous/blocking model of computation we usually say that a thread of execution will wait (be blocked ) while it waits for an IO task to complete. My question is simply will this usually cause the CPU core executing the thread to be idle, or will a thread waiting on IO usually be context switched out and put into a waiting state until the IO is ready to be processed? 回答1: A CPU core is normally not dedicated to one particular thread of execution. The kernel is constantly switching

Address translation with multiple pagesize-specific TLBs

阅读更多关于 Address translation with multiple pagesize-specific TLBs

问题 For Intel 64 and IA-32 processors, for both data and code independently, there may be both a 4KB TLB, and a Large Page (2MB, 1GB) TLB (LTLB). How does address translation work in this case? Would the hardware simply be able to access both in parallel, knowing that a double-hit can't occur? In the LTLBs, how would the entries be organized? I suppose, when the entry is originally filled from a page-structure entry, the LTLB entry could include information about how a hit on this entry would

How can I perform 64-bit division with a 32-bit divide instruction?

阅读更多关于 How can I perform 64-bit division with a 32-bit divide instruction?

问题 This is (AFAIK) a specific question within this general topic. Here's the situation: I have an embedded system (a video game console) based on a 32-bit RISC microcontroller (a variant of NEC's V810). I want to write a fixed-point math library. I read this article, but the accompanying source code is written in 386 assembly, so it's neither directly usable nor easily modifiable. The V810 has built-in integer multiply/divide, but I want to use the 18.14 format mentioned in the above article.

Assembly: Why are we bothering with registers?

阅读更多关于 Assembly: Why are we bothering with registers?

问题 I have a basic question about assembly. Why do we bother doing arithmetic operations only on registers if they can work on memory as well? For example both of the following cause (essentially) the same value to be calculated as an answer: Snippet 1 .data var dd 00000400h .code Start: add var,0000000Bh mov eax,var ;breakpoint: var = 00000B04 End Start Snippet 2 .code Start: mov eax,00000400h add eax,0000000bh ;breakpoint: eax = 0000040B End Start From what I can see most texts and tutorials do