cpu-architecture | 易学教程

Detecting CPU architecture compile-time

阅读更多关于 Detecting CPU architecture compile-time

问题 What is the most reliable way to find out CPU architecture when compiling C or C++ code? As far as I can tell, different compilers have their own set of non-standard preprocessor definitions ( _M_X86 in MSVS, __i386__ , __arm__ in GCC, etc). Is there a standard way to detect the architecture I\'m building for? If not, is there a source for a comprehensive list of such definitions for various compilers, such as a header with all the boilerplate #ifdef s? 回答1: Here is some information about Pre

FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2

阅读更多关于 FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2

问题 I\'m confused on how many flops per cycle per core can be done with Sandy-Bridge and Haswell. As I understand it with SSE it should be 4 flops per cycle per core for SSE and 8 flops per cycle per core for AVX/AVX2. This seems to be verified here, How do I achieve the theoretical maximum of 4 FLOPs per cycle? ,and here, Sandy-Bridge CPU specification. However the link below seems to indicate that Sandy-bridge can do 16 flops per cycle per core and Haswell 32 flops per cycle per core http://www

Which cache mapping technique is used in intel core i7 processor?

阅读更多关于 Which cache mapping technique is used in intel core i7 processor?

问题 I have learned about different cache mapping technique like direct mapping,associate mapping and set associative mapping technique and also learned the trade-offs. But I am curious what is used in intel core i7 or AMD processor nowadays. And how the techniques are evolved. And what are things that needs to be improved? 回答1: Direct-mapped caches are basically never used in modern high-performance CPUs . The power savings are outweighed by the large advantage in hit rate for a set-associative

Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?

阅读更多关于 Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?

问题 Why is the size of L1 cache smaller than that of the L2 cache in most of the processors ? 回答1: There are different reasons for that. L2 exists in the system to speedup the case where there is a L1 cache miss. If the size of L1 was the same or bigger than the size of L2, then L2 could not accomodate for more cache lines than L1, and would not be able to deal with L1 cache misses. From the design/cost perspective, L1 cache is bound to the processor and faster than L2. The whole idea of caches

Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?

阅读更多关于 Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?

问题 We\'ve got a simple memory throughput benchmark. All it does is memcpy repeatedly for a large block of memory. Looking at the results (compiled for 64-bit) on a few different machines, Skylake machines do significantly better than Broadwell-E, keeping OS (Win10-64), processor speed, and RAM speed (DDR4-2133) the same. We\'re not talking a few percentage points, but rather a factor of about 2 . Skylake is configured dual-channel, and the results for Broadwell-E don\'t vary for dual/triple/quad

What Every Programmer Should Know About Memory?

阅读更多关于 What Every Programmer Should Know About Memory?

问题 I am wondering how much of Ulrich Drepper\'s What Every Programmer Should Know About Memory from 2007 is still valid. Also I could not find a newer version than 1.0 or an errata. 回答1: As far as I remember Drepper's content describes fundamental concepts about memory: how CPU cache works, what are physical and virtual memory and how Linux kernel deals that zoo. Probably there are outdated API references in some examples, but it doesn't matter; that won't affect the relevance of the fundamental

Are loads and stores the only instructions that gets reordered?

阅读更多关于 Are loads and stores the only instructions that gets reordered?

问题 I have read many articles on memory ordering, and all of them only say that a CPU reorders loads and stores. Does a CPU (I\'m specifically interested in an x86 CPU) only reorders loads and stores, and does not reorder the rest of the instructions that it have? 回答1: Out-of-order execution preserves the illusion of running in program order for a single thread/core . This is like the C/C++ as-if optimization rule: do whatever you want internally as long as visible effects are the same. Separate

How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent

阅读更多关于 How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent

问题 This loop runs at one iteration per 3 cycles on Intel Conroe/Merom, bottlenecked on imul throughput as expected. But on Haswell/Skylake, it runs at one iteration per 11 cycles, apparently because setnz al has a dependency on the last imul . ; synthetic micro-benchmark to test partial-register renaming mov ecx, 1000000000 .loop: ; do{ imul eax, eax ; a dep chain with high latency but also high throughput imul eax, eax imul eax, eax dec ecx ; set ZF, independent of old ZF. (Use sub ecx,1 on

Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?

阅读更多关于 Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?

问题 LOOP (Intel ref manual entry) decrements ecx / rcx, and then jumps if non-zero. It\'s slow, but couldn\'t Intel have cheaply made it fast? dec/jnz already macro-fuses into a single uop on Sandybridge-family; the only difference being that that sets flags. loop on various microarchitectures, from Agner Fog\'s instruction tables: K8/K10: 7 m-ops Bulldozer-family/Ryzen : 1 m-op (same cost as macro-fused test-and-branch, or jecxz ) P4: 4 uops (same as jecxz ) P6 (PII/PIII): 8 uops Pentium M,

What is the purpose of the “Prefer 32-bit” setting in Visual Studio and how does it actually work?

阅读更多关于 What is the purpose of the “Prefer 32-bit” setting in Visual Studio and how does it actually work?

It is unclear to me how the compiler will automatically know to compile for 64-bit when it needs to. How does it know when it can confidently target 32-bit? I am mainly curious about how the compiler knows which architecture to target when compiling. Does it analyze the code and make a decision based on what it finds? Lex Li Microsoft has a blog entry What AnyCPU Really Means As Of .NET 4.5 and Visual Studio 11 : In .NET 4.5 and Visual Studio 11 the cheese has been moved. The default for most .NET projects is again AnyCPU, but there is more than one meaning to AnyCPU now. There is an