cpu-architecture

Why is program counter incremented by 1 if memory organised as word and by 2 in case of bytes?

こ雲淡風輕ζ 提交于 2020-01-24 10:34:06
问题 If in a computer an instruction is of 16 bits and if memory is organized as 16-bits words, then the address of the next instruction is evaluated by adding one in the address of the current instruction. In case, the memory is organized as bytes, which can be addressed individually, then we need to add two in the current instruction address to get the address of the next instruction to be executed in sequence. Why is it so?? Please explain this concept. I am new to computer organisation and

Is there any difference in between (rdtsc + lfence + rdtsc) and (rdtsc + rdtscp) in measuring execution time?

假装没事ソ 提交于 2020-01-24 09:29:28
问题 As far as I know, the main difference in runtime ordering in a processor with respect to rdtsc and rdtscp instruction is that whether the execution waits until all previous instructions are executed locally. In other words, it means lfence + rdtsc = rdtscp because lfence preceding the rdtsc instruction makes the following rdtsc to be executed after all previous instruction finish locally. However, I've seen some example code that uses rdtsc at the start of measurement and rdtscp at the end.

How does the indexing of the Ice Lake's 48KiB L1 data cache work?

孤街浪徒 提交于 2020-01-24 04:27:05
问题 The Intel manual optimization (revision September 2019) shows a 48 KiB 8-way associative L1 data cache for the Ice Lake microarchitecture. 1 Software-visible latency/bandwidth will vary depending on access patterns and other factors. This baffled me because: There are 96 sets (48 KiB / 64 / 8), which is not a power of two. The indexing bits of a set and the indexing bits of the byte offset add to more than 12 bits, this makes the cheap-PIPT-as-VIPT-trick not available for 4KiB pages. All in

Game Boy: Half-carry flag and 16-bit instructions (especially opcode 0xE8)

。_饼干妹妹 提交于 2020-01-23 05:14:11
问题 Like so many others, I am writing a Game Boy emulator and I have a couple of questions regarding the instruction 0xE8 ( ADD SP, n with an 8-bit immediate). It is claimed here that in 16-bit instructions the half-carry flag is set if a carry occurs from bit 7 to bit 8, whereas here it is said that the half-carry flag indicates carry from bit 11 to bit 12. In this Reddit thread there seems to be a bit of confusion regarding the issue, and the (notoriously flawed, I hear) Game Boy CPU manual

Why is acquire semantics only for reads, not writes? How can an LL/SC acquire CAS take a lock without the store reordering with the critical section?

只愿长相守 提交于 2020-01-22 16:36:06
问题 To start with, consider release semantics. If a data set is protected with a spinlock (mutex, etc. - no matters what exact implementation is used; for now, assume 0 means it's free and 1 - busy). After changing of the data set, a thread stores 0 to spinlock address. To force visibility of all previous actions before storing 0 to spinlock address, storing is executed with release semantics, that means all previous reads and writes shall be made visible to other threads before this storing. It

Intel CPUs Instruction Queue provides static branch prediction?

99封情书 提交于 2020-01-22 05:50:08
问题 In Volume 3 of the Intel Manuals it contains the description of a hardware event counter: BACLEAR_FORCE_IQ Counts number of times a BACLEAR was forced by the Instruction Queue. The IQ is also responsible for providing conditional branch prediction direction based on a static scheme and dynamic data provided by the L2 Branch Prediction Unit. If the conditional branch target is not found in the Target Array and the IQ predicts that the branch is taken, then the IQ will force the Branch Address

Why denormalized floats are so much slower than other floats, from hardware architecture viewpoint?

时间秒杀一切 提交于 2020-01-21 02:42:46
问题 Denormals are known to underperform severely, 100x or so, compared to normals. This frequently causes unexpected software problems. I'm curious, from CPU Architecture viewpoint, why denormals have to be that much slower? Is the lack of performance is intrinsic to their unfortunate representation? Or maybe CPU architects neglect them to reduce hardware cost under the (mistaken) assumption that denormals don't matter? In the former case, if denormals are intrinsically hardware-unfriendly, are

Inclusive or exclusive ? L1, L2 cache in Intel Core IvyBridge processor

帅比萌擦擦* 提交于 2020-01-21 02:13:06
问题 I am having Intel Core IvyBridge processor , Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz( L1-32KB,L2-256KB,L3-8MB). I know L3 is inclusive and shared among multiple core. I want to know the following with respect to my system PART1 : L1 is inclusive or exclusive ? L2 is inclusive or exclusive ? PART2 : If L1 and L2 are both inclusive then to find the access time of L2 we first declare an array(1MB) of size more than L2 cache(256KB) , then start accessing the whole array to load into L2 cache.

Inclusive or exclusive ? L1, L2 cache in Intel Core IvyBridge processor

╄→гoц情女王★ 提交于 2020-01-21 02:12:04
问题 I am having Intel Core IvyBridge processor , Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz( L1-32KB,L2-256KB,L3-8MB). I know L3 is inclusive and shared among multiple core. I want to know the following with respect to my system PART1 : L1 is inclusive or exclusive ? L2 is inclusive or exclusive ? PART2 : If L1 and L2 are both inclusive then to find the access time of L2 we first declare an array(1MB) of size more than L2 cache(256KB) , then start accessing the whole array to load into L2 cache.

What is difference between sjlj vs dwarf vs seh?

独自空忆成欢 提交于 2020-01-18 08:12:08
问题 I can't find enough information to decide which compiler should I use to compile my project. There are several programs on different computers simulating a process. On Linux, I'm using GCC. Everything is great. I can optimize code, it compiles fast and uses not-so-much memory. I do my own benchmark with MSVC and GCC compilers. Later one produces slightly faster binaries (for each subarchitecture). Though compile time is much more than MSVC. So I decided to use MinGW. But can't find any