cpu-architecture | 易学教程

Why do x86-64 systems have only a 48 bit virtual address space?

阅读更多关于 Why do x86-64 systems have only a 48 bit virtual address space?

In a book I read the following: 32-bit processors have 2^32 possible addresses, while current 64-bit processors have a 48-bit address space My expectation was that if it's a 64-bit processor, the address space should also be 2^64. So I was wondering what is the reason for this limitation? Because that's all that's needed. 48 bits give you an address space of 256 terabyte. That's a lot. You're not going to see a system which needs more than that any time soon. So CPU manufacturers took a shortcut. They use an instruction set which allows a full 64-bit address space, but current CPUs just only

How does 32-bit address 4GB if 2³² bits = 4 Billion bits not Bytes?

阅读更多关于 How does 32-bit address 4GB if 2³² bits = 4 Billion bits not Bytes?

问题 Essentially, how does 4Gb turn into 4GB? If the memory is addressing Bytes , should not the possibilities be 2 (32/8) ? 回答1: It depends on how you address the data. If you use 32 bits to address each bit , you can address 2 32 bits or 4Gb = 512MB. If you address bytes like most current architectures it will give you 4GB. But if you address much larger blocks you will need less bits to address 4GB. For example if you address each 512-byte block (2^9 bytes) you can address 4GB with 23 bits.

how to interpret perf iTLB-loads,iTLB-load-misses

阅读更多关于 how to interpret perf iTLB-loads,iTLB-load-misses

问题 I have a test case to observe perf iTLB-loads,iTLB-load-misses by perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses -p 22479 and get the output : Performance counter stats for process id '22479': 1,262,817 dTLB-loads 13,950 dTLB-load-misses # 1.10% of all dTLB cache hits 75 iTLB-loads 6,882 iTLB-load-misses # 9176.00% of all iTLB cache hits 3.999720948 seconds time elapsed I have no idea how to interpret iTLB-loads only 75 but iTLB-load-misses 6,882 ?! lscpu showes : Intel

Is LFENCE serializing on AMD processors?

阅读更多关于 Is LFENCE serializing on AMD processors?

In recent Intel ISA documents the lfence instruction has been defined as serializing the instruction stream (preventing out-of-order execution across it). In particular, the description of the instruction includes this line: Specifically, LFENCE does not execute until all prior instructions have completed locally, and no later instruction begins execution until LFENCE completes. Note that this applies to all instructions, not just memory load instructions, making lfence more than just a memory ordering fence. Although this now appears in the ISA documentation, it isn't clear if it is

API call to get processor architecture

阅读更多关于 API call to get processor architecture

问题 As part of my app I'm using the NDK and was wondering if it's worth bundling x86 and mips binaries alongside the standard ARM binaries. I figured the best way would be to track what my users actually have, is there an API call to grab the processor architecture so I can pass this back to my Google analytics instance? Thanks 回答1: Actually, you can get the architecture without the need for reflexion at all: String arch = System.getProperty("os.arch"); From my tests it returned armv71 and i686 .

Why does Intel hide internal RISC core in their processors?

阅读更多关于 Why does Intel hide internal RISC core in their processors?

问题 Starting with Pentium Pro (P6 microarchitecture), Intel redesigned it's microprocessors and used internal RISC core under the old CISC instructions. Since Pentium Pro all CISC instructions are divided into smaller parts (uops) and then executed by the RISC core. At the beginning it was clear for me that Intel decided to hide new internal architecture and force programmers to use "CISC shell". Thanks to this decision Intel could fully redesign microprocessors architecture without breaking

Memory latency measurement with time stamp counter

阅读更多关于 Memory latency measurement with time stamp counter

问题 I have written the following code which first flushes two array elements and then tries to read elements in order to measure the hit/miss latencies. #include <stdio.h> #include <stdint.h> #include <x86intrin.h> #include <time.h> int main() { /* create array */ int array[ 100 ]; int i; for ( i = 0; i < 100; i++ ) array[ i ] = i; // bring array to the cache uint64_t t1, t2, ov, diff1, diff2, diff3; /* flush the first cache line */ _mm_lfence(); _mm_clflush( &array[ 30 ] ); _mm_clflush( &array[

What happens after a L2 TLB miss?

阅读更多关于 What happens after a L2 TLB miss?

I'm struggling to understand what happens when the first two levels of the Translation Lookaside Buffer result in misses? I am unsure whether "page walking" occurs in special hardware circuitry, or whether the page tables are stored in the L2/L3 cache, or whether they only reside in main memory. Modern x86 microarchitectures have dedicated page-walk hardware . They can even speculatively do page-walks to load TLB entries before a TLB miss actually happens . Skylake can even have two page walks in flight at once, see Section 2.1.3 of Intel's optimization manual . This may be related to the page

How to use Fused Multiply-Add (FMA) instructions with SSE/AVX

阅读更多关于 How to use Fused Multiply-Add (FMA) instructions with SSE/AVX

I have learned that some Intel/AMD CPUs can do simultanous multiply and add with SSE/AVX: FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2 . I like to know how to do this best in code and I also want to know how it's done internally in the CPU. I mean with the super-scalar architecture. Let's say I want to do a long sum such as the following in SSE: //sum = a1*b1 + a2*b2 + a3*b3 +... where a is a scalar and b is a SIMD vector (e.g. from matrix multiplication) sum = _mm_set1_ps(0.0f); a1 = _mm_set1_ps(a[0]); b1 = _mm_load_ps(&b[0]); sum = _mm_add_ps(sum, _mm_mul_ps(a1, b1)); a2 = _mm

Dependent loads reordering in CPU

阅读更多关于 Dependent loads reordering in CPU

问题 I have been reading Memory Barriers: A Hardware View For Software Hackers, a very popular article by Paul E. McKenney. One of the things the paper highlights is that, very weakly ordered processors like Alpha, can reorder dependent loads which seems to be a side effect of partitioned cache Snippet from the paper: 1 struct el *insert(long key, long data) 2 { 3 struct el *p; 4 p = kmalloc(sizeof(*p), GPF_ATOMIC); 5 spin_lock(&mutex); 6 p->next = head.next; 7 p->key = key; 8 p->data = data; 9