cpu-architecture | 易学教程

Why misaligned address access incur 2 or more accesses?

阅读更多关于 Why misaligned address access incur 2 or more accesses?

The normal answers to why data alignment is to access more efficiently and to simplify the design of CPU. A relevant question and its answers is here . And another source is here . But they both do not resolve my question. Suppose a CPU has a access granularity of 4 bytes. That means the CPU reads 4 bytes at a time. The material I listed above both says that if I access a misaligned data, say address 0x1, then the CPU has to do 2 accesses (one from addresses 0x0, 0x1, 0x2 and 0x3, one from addresses 0x4, 0x5, 0x6 and 0x7) and combine the results. I can't see why. Why just can't CPU read data

What does the R stand for in RAX, RBX, RCX, RDX, RSI, RDI, RBP, RSP? [duplicate]

阅读更多关于 What does the R stand for in RAX, RBX, RCX, RDX, RSI, RDI, RBP, RSP? [duplicate]

This question already has an answer here: What do the E and R prefixes stand for in the names of Intel 32-bit and 64-bit registers? 1 answer The x86 assembler language has had to change as the x86 processor architecture has changed from 8bit to 16bit to 32bit and now 64bit. I know that in 32bit assembler register names (EAX, EBX, etc.), the E prefix for each of the names stands for Extended meaning the 32bit form of the register rather than the 16bit form (AX, BX, etc.). What does the R prefix for these register names stand for in 64bit? I think it's just R for "register", since there are

VIPT Cache: Connection between TLB & Cache?

阅读更多关于 VIPT Cache: Connection between TLB & Cache?

问题 I just want to clarify the concept and could find detail enough answers which can throw some light upon how everything actually works out in the hardware. Please provide any relevant details. In case of VIPT caches, the memory request is sent in parallel to both the TLB and the Cache. From the TLB we get the traslated physical address. From the cache indexing we get a list of tags (e.g. from all the cache lines belonging to a set). Then the translated TLB address is matched with the list of

Peak FLOPs per cycle for ARM11 and Cortex-A7 cores in Raspberry Pi 1 and 2

阅读更多关于 Peak FLOPs per cycle for ARM11 and Cortex-A7 cores in Raspberry Pi 1 and 2

问题 I would like to know the peak FLOPs per cycle for the ARM1176JZF-S core in the the Raspberry Pi 1 and Cortex-A7 cores in the Raspberry Pi 2. From the ARM1176JZF-S Technical Reference Manual it seems that VFPv2 can do one SP MAC every clock cycle and one DP MAC every other clock cycle. In addition there are three pipelines which can operate in parallel: a MAC pipeline (FMAC), a division and sqrt pipeline (DS), and a load/store pipeline (LS). Based on this then it appears the ARM1176JZF-S of

How do Intel CPUs that use the ring bus topology decode and handle port I/O operations

阅读更多关于 How do Intel CPUs that use the ring bus topology decode and handle port I/O operations

问题 I understand Port I/O from a hardware abstraction level (i.e. asserts a pin that indicates to devices on the bus that the address is a port address, which makes sense on earlier CPUs with a simple address bus model) but I'm not really sure how it's implemented on modern CPUs microarchitecturally but also particularly how the Port I/O operation appears on the ring bus. Firstly. Where does the IN/OUT instruction get allocated to, the reservation station or the load/store buffer? My initial

What is the “EU” in x86 architecture? (calculates effective address?)

阅读更多关于 What is the “EU” in x86 architecture? (calculates effective address?)

问题 I read somewhere that effective addresses (as in the LEA instruction) in x86 instructions are calculated by the "EU." What is the EU? What is involved exactly in calculating an effective address? I've only learned about the MC68k instruction set (UC Boulder teaches this first) and I can't find a good x86 webpage by searching the web. 回答1: "EU" is the generic term for Execution Unit. The ALU is one example of an execution unit. FADD and FMUL, i.e. the floating point adder or multiplier, are

The integer division algorithm of Intel's x86 processors

阅读更多关于 The integer division algorithm of Intel's x86 processors

问题 Which integer division algorithm does Intel implement in their x86 processors? 回答1: Intel has a paper, Improvements in the Intel® Core™2 Processor Family Architecture and Microarchitecture, in which they discuss a number of different division algorithms. The first paragraph: The new Radix-16 floating-point divider with variable latency Radix-16 integer divide capability replaces the Merom Radix-4 floating point divide and Radix-2 square root and integer divide hardware. The preceding

Why isn't RDTSC a serializing instruction?

阅读更多关于 Why isn't RDTSC a serializing instruction?

The Intel manuals for the RDTSC instruction warn that out of order execution can change when RDTSC is actually executed, so they recommend inserting a CPUID instruction in front of it because CPUID will serialize the instruction stream (CPUID is never executed out of order). My question is simple: if they had the ability to make instructions serializing, why didn't they make RDTSC serializing? The entire point of it appears to be to get cycle accurate timings. Is there a situation under which you would not want to precede it with a serializing instruction? Newer Intel CPUs have a separate

How has CPU architecture evolution affected virtual function call performance?

阅读更多关于 How has CPU architecture evolution affected virtual function call performance?

Years ago I was learning about x86 assembler, CPU pipelining, cache misses, branch prediction, and all that jazz. It was a tale of two halves. I read about all the wonderful advantages of the lengthy pipelines in the processor viz instruction reordering, cache preloading, dependency interleaving, etc. The downside was that any deviation for the norm was enormously costly. For example, IIRC a certain AMD processor in the early-gigahertz era had a 40 cycle penalty every time you called a function through a pointer (!) and this was apparently normal. This is not a negligible "don't worry about it

Where is the Write-Combining Buffer located? x86

阅读更多关于 Where is the Write-Combining Buffer located? x86

问题 How is the Write-Combine buffer physically hooked up? I have seen block diagrams illustrating a number of variants: Between L1 and Memory controller Between CPU's store buffer and Memory controller Between CPU's AGUs and/or store units Is it microarchitecture-dependent? 回答1: Write buffers can have different purposes or different uses in different processors. This answer may not apply to processors not specifically mentioned. I'd like to emphasis that the term "write buffer" may mean different