cpu-architecture

CPU and Data alignment

别来无恙 提交于 2019-11-26 12:27:52
问题 Pardon me if you feel this has been answered numerous times, but I need answers to the following queries! Why data has to be aligned (on 2-byte / 4-byte / 8-byte boundaries)? Here my doubt is when the CPU has address lines Ax Ax-1 Ax-2 ... A2 A1 A0 then it is quite possible to address the memory locations sequentially. So why there is the need to align the data at specific boundaries? How to find the alignment requirements when I am compiling my code and generating the executable? If for e.g

FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2

a 夏天 提交于 2019-11-26 12:03:23
I'm confused on how many flops per cycle per core can be done with Sandy-Bridge and Haswell. As I understand it with SSE it should be 4 flops per cycle per core for SSE and 8 flops per cycle per core for AVX/AVX2. This seems to be verified here, How do I achieve the theoretical maximum of 4 FLOPs per cycle? ,and here, Sandy-Bridge CPU specification . However the link below seems to indicate that Sandy-bridge can do 16 flops per cycle per core and Haswell 32 flops per cycle per core http://www.extremetech.com/computing/136219-intels-haswell-is-an-unprecedented-threat-to-nvidia-amd . Can someone

Which cache mapping technique is used in intel core i7 processor?

一个人想着一个人 提交于 2019-11-26 11:23:34
I have learned about different cache mapping technique like direct mapping,associate mapping and set associative mapping technique and also learned the trade-offs. But I am curious what is used in intel core i7 or AMD processor nowadays. And how the techniques are evolved. And what are things that needs to be improved? Direct-mapped caches are basically never used in modern high-performance CPUs . The power savings are outweighed by the large advantage in hit rate for a set-associative cache of the same size, with only a bit more complexity in the control logic. Transistor budgets are very

Why is x86 little endian?

假如想象 提交于 2019-11-26 10:28:49
问题 A real question that I\'ve been asking myself lately is what design choices brought about x86 being a little endian architecture instead of a big endian architecture? 回答1: Largely, for the same reason you start at the least significant digit (the right end) when you add—because carries propagate toward the more significant digits. Putting the least significant byte first allows the processor to get started on the add after having read only the first byte of an offset. After you've done enough

Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?

≯℡__Kan透↙ 提交于 2019-11-26 10:28:36
Why is the size of L1 cache smaller than that of the L2 cache in most of the processors ? There are different reasons for that. L2 exists in the system to speedup the case where there is a L1 cache miss. If the size of L1 was the same or bigger than the size of L2, then L2 could not accomodate for more cache lines than L1, and would not be able to deal with L1 cache misses. From the design/cost perspective, L1 cache is bound to the processor and faster than L2. The whole idea of caches is that you speed up access to the slower hardware by adding intermediate hardware that is more performing

Globally Invisible load instructions

我的梦境 提交于 2019-11-26 10:00:15
问题 Can some of the load instructions be never globally visible due to store load forwarding ? To put it another way, if a load instruction gets its value from the store buffer, it never has to read from the cache. As it is generally stated that a load is globally visible when it reads from the L1D cache, the ones that do not read from the L1D should make it globally invisible. 回答1: The concept of global visibility for loads is tricky, because a load doesn't modify the global state of memory, and

Slow jmp-instruction

牧云@^-^@ 提交于 2019-11-26 09:59:30
问题 As follow up to my question The advantages of using 32bit registers/instructions in x86-64, I started to measure the costs of instructions. I\'m aware that this have been done multiple times (e.g. Agner Fog), but I\'m doing it for fun and self education. My testing code is pretty simple (for simplicity here as pseudo code, in reality in assembler): for(outer_loop=0; outer_loop<NO;outer_loop++){ operation #first operation #second ... operation #NI-th } But yet some things should be considered.

On 32-bit CPUs, is an &#39;integer&#39; type more efficient than a &#39;short&#39; type?

╄→гoц情女王★ 提交于 2019-11-26 09:52:58
问题 On a 32-bit CPU, an integer is 4 bytes and a short integer is 2 bytes. If I am writing a C/C++ application that uses many numeric values that will always fit within the provided range of a short integer, is it more efficient to use 4 byte integers or 2 byte integers? I have heard it suggested that 4 byte integers are more efficient as this fits the bandwidth of the bus from memory to the CPU. However, if I am adding together two short integers, would the CPU package both values in a single

Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?

。_饼干妹妹 提交于 2019-11-26 08:52:04
We've got a simple memory throughput benchmark. All it does is memcpy repeatedly for a large block of memory. Looking at the results (compiled for 64-bit) on a few different machines, Skylake machines do significantly better than Broadwell-E, keeping OS (Win10-64), processor speed, and RAM speed (DDR4-2133) the same. We're not talking a few percentage points, but rather a factor of about 2 . Skylake is configured dual-channel, and the results for Broadwell-E don't vary for dual/triple/quad-channel. Any ideas why this might be happening? The code that follows is compiled in Release in VS2015,

Why do x86-64 systems have only a 48 bit virtual address space?

时光毁灭记忆、已成空白 提交于 2019-11-26 08:48:45
问题 In a book I read the following: 32-bit processors have 2^32 possible addresses, while current 64-bit processors have a 48-bit address space My expectation was that if it\'s a 64-bit processor, the address space should also be 2^64. So I was wondering what is the reason for this limitation? 回答1: Because that's all that's needed. 48 bits give you an address space of 256 terabyte. That's a lot. You're not going to see a system which needs more than that any time soon. So CPU manufacturers took a