cpu-architecture | 易学教程

Virtually indexed physically tagged cache Synonym

阅读更多关于 Virtually indexed physically tagged cache Synonym

问题 I am not able to entirely grasp the concept of synonyms or aliasing in VIPT caches. Consider the address split as:- Here, suppose we have 2 pages with different VA's mapped to same physical address(or frame no). The pageno part of VA (bits 13-39) which are different gets translated to PFN of PA (bits 12-35) and the PFN remains same for both the VA's as they are mapped to same physical frame. Now the pageoffset part(bits 0-13) of both the VA's are same as the data which they want to access

What is the difference between a store queue and a store buffer?

阅读更多关于 What is the difference between a store queue and a store buffer?

问题 I am reading a number of papers and they are either using store buffer and store queue interchangeably or they are relating to different structures, and I just cannot follow along. This is what I thought a store queue was: It is an associatively searchable FIFO queue that keeps information about store instructions in fetch order. It keeps store addresses and data. It keeps store instructions' data until the instructions become non-speculative, i.e. they reach retirement stage. Data of a store

How do I find my CPU topology?

阅读更多关于 How do I find my CPU topology?

问题 I am using Intel(R) Core(TM) i5 CPU M 560 @ 2.67GHz as I found out from cat /proc/cpuinfo . But I want to know exact hierarchy like how many sockets are there, and how many cores are there per socket and threads too, if supported. Any idea? 回答1: you can use command lscpu this will give information for processor related info dmidecode -t processor 回答2: lstopo from the hwloc package reports the info you want: Socket L#0 + L3 L#0 (6144KB) L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 PU L#0 (P#0) PU

Can branch prediction cause illegal instruction?

阅读更多关于 Can branch prediction cause illegal instruction?

问题 In the following pseudo-code: if (rdtscp supported by hardware) { Invoke "rdtscp" instruction } else { Invoke "rdtsc" instruction } Let's say the CPU does not support the rdtscp instruction and so we fallback to the else statement. If CPU mispredicts the branch, is it possible for the instruction pipeline to try to execute rdtscp and throw an Illgal Instruction error? 回答1: It is explicitly documented for the #UD trap (Invalid Opcode Execution) in the Intel Processor Manuals, Volume 3A,

x86 and x64 share instruction set?

阅读更多关于 x86 and x64 share instruction set?

I don't know how 32bit application can run on a 64bit OS. My understanding is 32bit/64bit refers to register size. An instruction set should be different as they have different sizes of register. But I know there is x86-64 instruction set that is the 64bit version of the x86 instruction set. Is the reason we can run 32bit application on 64bit OS is because of the x86-64? If so, why are 32bit applications sometimes not compatible in 64bit windows? Why do we need WOW64? (Sometimes we are asked to choose which version to install.) Does x64 instruction set have any other instruction set except x86

The ordering of L1 cache controller to process memory requests from CPU

阅读更多关于 The ordering of L1 cache controller to process memory requests from CPU

Under the total store order(TSO) memory consistency model, a x86 cpu will have a write buffer to buffer write requests and can serve reordered read requests from the write buffer. And it says that the write requests in the write buffer will exit and be issued toward cache hierarchy in FIFO order, which is the same as program order. I am curious about: To serve the write requests issued from the write buffer, does L1 cache controller handle the write requests, finish the cache coherence of the write requests and insert data into L1 cache in the same order as the issue order? I think you're

How does CPU make data request via TLBs and caches?

阅读更多关于 How does CPU make data request via TLBs and caches?

问题 I am observing the last few Intel microarchitectures (Nehalem/SB/IB and Haswell). I am trying to work out what happens (at a fairly simplified level) when a data request is made. So far I have this rough idea: Execution engine makes data request "Memory control" queries the L1 DTLB If the above misses, the L2 TLB is now queried At this point two things can happen, a miss or a hit: If its a hit the CPU tries L1D/L2/L3 caches, page table and then main memory/hard disk in that order? If its a

Are Intel x86_64 processors not only pipelined architecture, but also superscalar?

阅读更多关于 Are Intel x86_64 processors not only pipelined architecture, but also superscalar?

Are Intel x86_64 processors not only pipelined architecture, but also superscalar? Pipelining - these two sequences execute in parallel (different stages of the same pipeline-unit in the same clock, for example ADD with 4 stages): stage1 -> stage2 -> stage3 -> stage4 -> nothing nothing -> stage1 -> stage2 -> stage3 -> stage4 Superscalar - these two sequences execute in parallel (two instructions can be launched to different pipeline-units in the same clock, for example ADD and MUL): ADD(stage1) -> ADD(stage2) -> ADD(stage3) MUL(stage1) -> MUL(stage2) -> MUL(stage3) Yes, contemporary Intel

Convert object file to another architecture

阅读更多关于 Convert object file to another architecture

问题 I am trying to use a Wifi-Dongle with a Raspberry Pi. The vendor of the dongle provides a Linux driver that I can compile successfully on the ARM-architecture, however, one object file, that comes with the driver, was precompiled for a x86-architecture, which causes the linker to fail. I know it would be much easier to compile that (quite big) file again, but I don't have access to the source code. Is it possible to convert that object file from a x86-architecture to an ARM-architecture?

cpu cacheline and prefetch policy

阅读更多关于 cpu cacheline and prefetch policy

问题 I read this article http://igoro.com/archive/gallery-of-processor-cache-effects/. The article said that because cacheline delay, the code: int[] arr = new int[64 * 1024 * 1024]; // Loop 1 for (int i = 0; i < arr.Length; i++) arr[i] *= 3; // Loop 2 for (int i = 0; i < arr.Length; i += 16) arr[i] *= 3; will almost have same execute time, and I wrote some sample c code to test it. I run the code on Xeon(R) E3-1230 V2 with Ubuntu 64bit, ARMv6-compatible processor rev 7 with Debian, and also run