cpu-architecture

Double-precision operations: 32-bit vs 64-bit machines

倖福魔咒の 提交于 2019-12-06 08:09:06
Why don't we see twice better performance when executing a 64-bit operations (e.g. Double precision operation) on a 64-bit machine, compared to executing on a 32-bit machine? In a 32-bit machine, don't we need to fetch from memory twice as much? more importantly, dont we need twice as much cycles to execute a 64-bit operation? “64-bit machine” is an ambiguous term but usually means that the processor's General-Purpose Registers are 64-bit wide. Compare 8086 and 8088 , which have the same instruction set and can both be called 16-bit processors in this sense . When the phrase is used in this

Assembly PC Relative Addressing Mode

China☆狼群 提交于 2019-12-06 07:46:50
问题 I am working on datapaths and have been trying to understand branch instructions. So this is what I understand. In MIPS, every instruction is 32 bits. This is 4 bytes. So the next instruction would be four bytes away. In terms of example, I say PC address is 128. My first issue is understanding what this 128 means. My current belief is that it is an index in the memory, so 128 refers to 128 bytes across in the memory. Therefore, in the datapath it always says to add 4 to the PC. Add 4 bits to

What causes the DTLB_LOAD_MISSES.WALK_* performance events to occur?

徘徊边缘 提交于 2019-12-06 07:12:17
Consider the following loop: .loop: add rsi, STRIDE mov eax, dword [rsi] dec ebp jg .loop where STRIDE is some non-negative integer and rsi contains a pointer to a buffer defined in the bss section. This loop is the only loop in the code. That is, it's not being initialized or touched before the loop. On Linux, all of the 4K virtual pages of the buffer will be mapped on-demand to the same physical page. I've ran this code for all possible strides in the range 0-8192. The measured number of minor and major page faults is exactly 1 and 0, respectively, per page accessed. I've also measured all

Detecting Aligned Memory requirement on target CPU

允我心安 提交于 2019-12-06 06:48:33
问题 I'm currently trying to build a code which is supposed to work on a wide range of machines, from handheld pockets and sensors to big servers in data centers. One of the (many) differences between these architectures is the requirement for aligned memory access. Aligned memory access is not required on "standard" x86 CPU, but many other CPU need it and produce an exception if the rule is not respected. Up to now, i've been dealing with it by forcing the compiler to be cautious on specific data

Are C++ int operations atomic on the mips architecture

大憨熊 提交于 2019-12-06 04:38:59
问题 I wonder if I could read or write shared int value without locking on mips cpu (especially Amazon or Danube). What I mean is if such a read or write are atomic (other thread can't interrupt them). To be clear - I don't want to prevent the race between threads, but I care if int value itself is not corrupted. Assuming that the compiler aligns all ints at the boundaries of cpu word, it should be possible. I use gcc (g++). Tests also shows that it seems work correctly. But maybe someone knows it

How to calculate effective CPI for a 3 level cache

时间秒杀一切 提交于 2019-12-06 03:03:27
问题 I am hopelessly stuck on a homework problem, and I would love some help understanding it better. Here is what I was given: CPU base CPI = 2, clock rate = 2GHz Primary Cache, Miss Rate/Instruction = 7% L-2 Cache access time = 15ns L-2 Cache, Local Miss Rate/Instruction = 30% L-3 Cache access time = 30ns L-3 Cache, Global Miss Rate/Instruction = 3%, Main memory access time = 150ns What is the effective CPI ? It is my understanding that I need to calculate the miss penalty for each cache level.

How do cores decide which cache line to invalidate in MESI?

℡╲_俬逩灬. 提交于 2019-12-06 02:54:30
I have some misunderstanding about cache lines. I'm using Haswell and Ubuntu . Now let's say we have 2-threaded application in which the following happens. mov [addr], dword 0xAC763F ;starting Thread 1 and Thread 2 Now let`s say the threads perform the following actions in parallel: Thread 1 Thread 2 mov rax, [addr] mov rax, [addr] mov [addr], dword 1 mov [addr], dword 2 Now in my understanding of what's going on is this: Before starting the main thread writes to the corresponding cache line ( addr ) and marks it as Exclusive . If both of the threads Thread 1 and Thread 2 finished reading

Why does high-memory not exist for 64-bit cpu?

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-06 02:42:06
While I am trying to understand the high memory problem for 32-bit cpu and Linux, why is there no high-memory problem for 64-bit cpu? In particular, how is the division of virtual memory into kernel space and user space changed, so that the requirement of high memory doesn't exist for 64-bit cpu? Thanks. A 32-bit system can only address 4GB of memory. In Linux this is divided into 3GB of user space and 1GB of kernel space. This 1GB is sometimes not enough so the kernel might need to map and unmap areas of memory which incurs a fairly significant performance penalty. The kernel space is the

How would the MONITOR instruction (_mm_monitor intrinsic) be used by a driver?

强颜欢笑 提交于 2019-12-06 01:54:13
I am exploring the usage of MONITOR instruction (or the equivalent intrinsic, _mm_monitor ). Although I found literature describing them, I could not find any concrete examples/samples on how to use it. Can anyone share an example of how this instruction/intrinsic would be used in a driver? Essentially, I would like to use it to watch memory ranges. The monitor instruction arms the address monitoring hardware using the address specified in RAX/EAX/AX . Quote from Intel The state of the monitor is used by the instruction mwait . The effective address size used (16, 32 or 64-bit) depends on the

What is the maximum possible IPC can be achieved by Intel Nehalem Microarchitecture?

心不动则不痛 提交于 2019-12-06 01:26:28
Is there an estimation for the maximum Instructions Per Cycle achievable by the Intel Nehalem Architecture? Also, what is the bottleneck that effects the maximum Instructions Per Cycle? TL:DR : Intel Core, Nehalem, and Sandybridge / IvyBridge: a maximum of 5 IPC, including 1 macro-fused cmp+branch to get 5 instructions into 4 fused-domain uops, and the rest being single-uop instruction. (up to 2 of these can be micro-fused store or load+ALU.) Haswell up to 9th Gens: a maximum of 6 instructions per cycle can be achieved using two pairs of macro-fusable ALU+branch instructions and two