cpu-architecture

Are there any modern CPUs where a cached byte store is actually slower than a word store?

左心房为你撑大大i 提交于 2019-11-26 17:52:07
问题 It's a common claim that a byte store into cache may result in an internal read-modify-write cycle, or otherwise hurt throughput or latency vs. storing a full register. But I've never seen any examples. No x86 CPUs are like this, and I think all high-performance CPUs can directly modify any byte in a cache-line, too. Are some microcontrollers or low-end CPUs different, if they have cache at all? ( I'm not counting word-addressable machines , or Alpha which is byte addressable but lacks byte

How can I determine for which platform an executable is compiled?

好久不见. 提交于 2019-11-26 17:28:06
I have a need to work with Windows executables which are made for x86, x64, and IA64. I'd like to programmatically figure out the platform by examining the files themselves. My target language is PowerShell but a C# example will do. Failing either of those, if you know the logic required that would be great. (from another Q, since removed) Machine type: This is a quick little bit of code I based on some that gets the linker timestamp. This is in the same header, and it seems to work - it returns I386 when compiled -any cpu-, and x64 when compiled with that as the target platform. The Exploring

How does memory reordering help processors and compilers?

两盒软妹~` 提交于 2019-11-26 17:04:18
问题 I studied the Java memory model and saw re-ordering problems. A simple example: boolean first = false; boolean second = false; void setValues() { first = true; second = true; } void checkValues() { while(!second); assert first; } Reordering is very unpredictable and weird. Also, it ruins abstractions. I suppose that processor architectures must have a good reason to do something that's so inconvenient for programmers. What are those reasons? There is a lot of information about how to handle

How are x86 uops scheduled, exactly?

此生再无相见时 提交于 2019-11-26 16:32:35
Modern x86 CPUs break down the incoming instruction stream into micro-operations (uops 1 ) and then schedule these uops out-of-order as their inputs become ready. While the basic idea is clear, I'd like to know the specific details of how ready instructions are scheduled, since it impacts micro-optimization decisions. For example, take the following toy loop 2 : top: lea eax, [ecx + 5] popcnt eax, eax add edi, eax dec ecx jnz top this basically implements the loop (with the following correspondence: eax -> total, c -> ecx ): do { total += popcnt(c + 5); } while (--c > 0); I'm familiar with the

Why IA32 does not allow memory to memory mov?

一世执手 提交于 2019-11-26 16:17:18
问题 In Intel architecture IA32, instructions like movl, movw does not allow operands that are both memory locations. For example, instruction movl (%eax), (%edx) is not permitted. Why? 回答1: The answer involves a fuller understanding of RAM. Simply stated, RAM can only be in two states, read mode or write mode. If you wish to copy one byte in ram to another location, you must have a temporary storage area outside of RAM as you switch from read to write. It is certainly possible for the

how much memory can be accessed by a 32 bit machine?

馋奶兔 提交于 2019-11-26 15:16:31
问题 What is meant by 32bit or 64 bit machine? It’s the processor architecture…a 32 bit machine can read and write 32bit data at a time same way with 64 bit machine…. whats the maximum memory that a 32 bit machine can access? It is 2^32=4Gb (4Gigabit = 0.5 GigaByte) That means 4Gb ram? If I consider the same way for a 64 bit machine then I can have a ram of 16ExbiBytes ..is that possible? Are my concepts right? 回答1: Yes, a 32-bit architecture is limited to addressing a maximum of 4 gigabytes of

atomic operation cost

扶醉桌前 提交于 2019-11-26 15:06:14
问题 What is the cost of the atomic operation (any of compare-and-swap or atomic add/decrement)? How much cycles does it consume? Will it pause other processors on SMP or NUMA, or will it block memory accesses? Will it flush reorder buffer in out-of-order CPU? What effects will be on the cache? I'm interested in modern, popular CPUs: x86, x86_64, PowerPC, SPARC, Itanium. 回答1: I have looked for actual data for the past days, and found nothing. However, I did some research, which compares the cost

What is a retpoline and how does it work?

冷暖自知 提交于 2019-11-26 14:58:12
问题 In order to mitigate against kernel or cross-process memory disclosure (the Spectre attack), the Linux kernel1 will be compiled with a new option, -mindirect-branch=thunk-extern introduced to gcc to perform indirect calls through a so-called retpoline . This appears to be a newly invented term as a Google search turns up only very recent use (generally all in 2018). What is a retpoline and how does it prevent the recent kernel information disclosure attacks? 1 It's not Linux specific, however

Size of store buffers on Intel hardware? What exactly is a store buffer?

 ̄綄美尐妖づ 提交于 2019-11-26 14:34:14
问题 The Intel optimization manual talks about the number of store buffers that exist in many parts of the processor, but do not seem to talk about the size of the store buffers. Is this public information or is the size of a store buffer kept as a microarchitectural detail? The processors I am looking into are primarily Broadwell and Skylake, but information about others would be nice as well. Also, what do store buffers do, exactly? 回答1: Related: what is a store buffer? The store buffer as a

Why is division more expensive than multiplication?

浪尽此生 提交于 2019-11-26 13:49:10
问题 I am not really trying to optimize anything, but I remember hearing this from programmers all the time, that I took it as a truth. After all they are supposed to know this stuff. But I wonder why is division actually slower than multiplication? Isn't division just a glorified subtraction, and multiplication is a glorified addition? So mathematically I don't see why going one way or the other has computationally very different costs. Can anyone please clarify the reason/cause of this so I know