cpu-architecture | 易学教程

Why is program counter incremented by 1 if memory organised as word and by 2 in case of bytes?

阅读更多关于 Why is program counter incremented by 1 if memory organised as word and by 2 in case of bytes?

问题 If in a computer an instruction is of 16 bits and if memory is organized as 16-bits words, then the address of the next instruction is evaluated by adding one in the address of the current instruction. In case, the memory is organized as bytes, which can be addressed individually, then we need to add two in the current instruction address to get the address of the next instruction to be executed in sequence. Why is it so?? Please explain this concept. I am new to computer organisation and

Is there any difference in between (rdtsc + lfence + rdtsc) and (rdtsc + rdtscp) in measuring execution time?

阅读更多关于 Is there any difference in between (rdtsc + lfence + rdtsc) and (rdtsc + rdtscp) in measuring execution time?

问题 As far as I know, the main difference in runtime ordering in a processor with respect to rdtsc and rdtscp instruction is that whether the execution waits until all previous instructions are executed locally. In other words, it means lfence + rdtsc = rdtscp because lfence preceding the rdtsc instruction makes the following rdtsc to be executed after all previous instruction finish locally. However, I've seen some example code that uses rdtsc at the start of measurement and rdtscp at the end.

How does the indexing of the Ice Lake's 48KiB L1 data cache work?

阅读更多关于 How does the indexing of the Ice Lake's 48KiB L1 data cache work?

问题 The Intel manual optimization (revision September 2019) shows a 48 KiB 8-way associative L1 data cache for the Ice Lake microarchitecture. 1 Software-visible latency/bandwidth will vary depending on access patterns and other factors. This baffled me because: There are 96 sets (48 KiB / 64 / 8), which is not a power of two. The indexing bits of a set and the indexing bits of the byte offset add to more than 12 bits, this makes the cheap-PIPT-as-VIPT-trick not available for 4KiB pages. All in

Game Boy: Half-carry flag and 16-bit instructions (especially opcode 0xE8)

阅读更多关于 Game Boy: Half-carry flag and 16-bit instructions (especially opcode 0xE8)

问题 Like so many others, I am writing a Game Boy emulator and I have a couple of questions regarding the instruction 0xE8 ( ADD SP, n with an 8-bit immediate). It is claimed here that in 16-bit instructions the half-carry flag is set if a carry occurs from bit 7 to bit 8, whereas here it is said that the half-carry flag indicates carry from bit 11 to bit 12. In this Reddit thread there seems to be a bit of confusion regarding the issue, and the (notoriously flawed, I hear) Game Boy CPU manual

Why is acquire semantics only for reads, not writes? How can an LL/SC acquire CAS take a lock without the store reordering with the critical section?

阅读更多关于 Why is acquire semantics only for reads, not writes? How can an LL/SC acquire CAS take a lock without the store reordering with the critical section?

问题 To start with, consider release semantics. If a data set is protected with a spinlock (mutex, etc. - no matters what exact implementation is used; for now, assume 0 means it's free and 1 - busy). After changing of the data set, a thread stores 0 to spinlock address. To force visibility of all previous actions before storing 0 to spinlock address, storing is executed with release semantics, that means all previous reads and writes shall be made visible to other threads before this storing. It

Intel CPUs Instruction Queue provides static branch prediction?

阅读更多关于 Intel CPUs Instruction Queue provides static branch prediction?

问题 In Volume 3 of the Intel Manuals it contains the description of a hardware event counter: BACLEAR_FORCE_IQ Counts number of times a BACLEAR was forced by the Instruction Queue. The IQ is also responsible for providing conditional branch prediction direction based on a static scheme and dynamic data provided by the L2 Branch Prediction Unit. If the conditional branch target is not found in the Target Array and the IQ predicts that the branch is taken, then the IQ will force the Branch Address

Why denormalized floats are so much slower than other floats, from hardware architecture viewpoint?

阅读更多关于 Why denormalized floats are so much slower than other floats, from hardware architecture viewpoint?

问题 Denormals are known to underperform severely, 100x or so, compared to normals. This frequently causes unexpected software problems. I'm curious, from CPU Architecture viewpoint, why denormals have to be that much slower? Is the lack of performance is intrinsic to their unfortunate representation? Or maybe CPU architects neglect them to reduce hardware cost under the (mistaken) assumption that denormals don't matter? In the former case, if denormals are intrinsically hardware-unfriendly, are

Inclusive or exclusive ? L1, L2 cache in Intel Core IvyBridge processor

阅读更多关于 Inclusive or exclusive ? L1, L2 cache in Intel Core IvyBridge processor

问题 I am having Intel Core IvyBridge processor , Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz( L1-32KB,L2-256KB,L3-8MB). I know L3 is inclusive and shared among multiple core. I want to know the following with respect to my system PART1 : L1 is inclusive or exclusive ? L2 is inclusive or exclusive ? PART2 : If L1 and L2 are both inclusive then to find the access time of L2 we first declare an array(1MB) of size more than L2 cache(256KB) , then start accessing the whole array to load into L2 cache.

Inclusive or exclusive ? L1, L2 cache in Intel Core IvyBridge processor

阅读更多关于 Inclusive or exclusive ? L1, L2 cache in Intel Core IvyBridge processor

What is difference between sjlj vs dwarf vs seh?

阅读更多关于 What is difference between sjlj vs dwarf vs seh?

问题 I can't find enough information to decide which compiler should I use to compile my project. There are several programs on different computers simulating a process. On Linux, I'm using GCC. Everything is great. I can optimize code, it compiles fast and uses not-so-much memory. I do my own benchmark with MSVC and GCC compilers. Later one produces slightly faster binaries (for each subarchitecture). Though compile time is much more than MSVC. So I decided to use MinGW. But can't find any