cpu-architecture

Why wasn't MASKMOVDQU extended to 256-bit and 512-bit stores?

 ̄綄美尐妖づ 提交于 2019-12-10 22:32:15
问题 The MASKMOVDQU 1 is special among x86 store instructions because, in principle, it allows you to store individual bytes in a cache line, without first loading the entire cache line all the way to the core so that the written bytes can be merged with the not-overwritten existing bytes. It would seem to works using the same mechanisms as an NT store: pushing the cache line down without first doing an RFO. Per the Intel software develope manual (emphasis mine): The MASKMOVQ instruction can be

superscalar and VLIW

ε祈祈猫儿з 提交于 2019-12-10 17:22:56
问题 I want to ask some questions related to ILP. A superscalar processor is sort of a mixture of the scalar and vector processor. So can I say that architectures of vector processor follows super-scalar ? Processing multiple instructions concurrently does not make an architecture superscalar, since pipelined, multiprocessor or multi-core architectures also achieve that. What does this means? I have read ' A superscalar CPU architecture implements a form of parallelism called instruction level

How to clear L1, L2 and L3 caches?

好久不见. 提交于 2019-12-10 15:44:10
问题 I am doing some cache performance measuring and I need to ensure the caches are empty of "useful" data before timing. Assuming an L3 cache is 10MB would it suffice to create a vector of 10M/4 = 2,500,000 floats, iterate through the whole of this vector, sum the numbers and that would empty the whole cache of any data which was in it prior to iterating through the vector? 回答1: Yes, that should be sufficient for flushing the L3 cache of useful data. I have done similar types of measurements and

Load half word and load byte in a single cycle datapath

时光毁灭记忆、已成空白 提交于 2019-12-10 14:41:13
问题 There was this problem that has been asked about implementing a load byte into a single cycle datapath without having to change the data memory, and the solution was something below. alt text http://img214.imageshack.us/img214/7107/99897101.jpg This is actually quite a realistic question; most memory systems are entirely word-based, and individual bytes are typically only dealt with inside the processor. When you see a “bus error” on many computers, this often means that the processor tried

How do cores decide which cache line to invalidate in MESI?

こ雲淡風輕ζ 提交于 2019-12-10 09:37:51
问题 I have some misunderstanding about cache lines. I'm using Haswell and Ubuntu . Now let's say we have 2-threaded application in which the following happens. mov [addr], dword 0xAC763F ;starting Thread 1 and Thread 2 Now let`s say the threads perform the following actions in parallel: Thread 1 Thread 2 mov rax, [addr] mov rax, [addr] mov [addr], dword 1 mov [addr], dword 2 Now in my understanding of what's going on is this: Before starting the main thread writes to the corresponding cache line

Would buffering cache changes prevent Meltdown?

微笑、不失礼 提交于 2019-12-10 07:44:37
问题 If new CPUs had a cache buffer which was only committed to the actual CPU cache if the instructions are ever committed would attacks similar to Meltdown still be possible? The proposal is to make speculative execution be able to load from memory, but not write to the CPU caches until they are actually committed. 回答1: TL:DR: yes I think it would solve Spectre (and Meltdown) in their current form (using a flush+read cache-timing side channel to copy the secret data from a physical register),

What is the purpose of the reserved/undefined bit in the flag register?

為{幸葍}努か 提交于 2019-12-10 06:19:26
问题 In the flag register of Z80, 8080, 8085, and 8086 processors, what is the purpose of bits 1, 3, 5, which are documented as "reserved" or "undefined"? 回答1: These bits are unused; that is, no instruction explicitly sets them to any value. The designers decided that 5/6 flags was enough, and they just left the remaining bits of the flags register unused. They are documented as being "undefined" because it is not possible to know in advance which value will they have after any of the instructions

CPU cache: does the distance between two address needs to be smaller than 8 bytes to have cache advantage?

别说谁变了你拦得住时间么 提交于 2019-12-10 05:34:56
问题 It may seem a weird question.. Say the a cache line's size is 64 bytes. Further, assume that L1, L2, L3 has the same cache line size (this post said it's the case for Intel Core i7). There are two objects A , B on memory, whose (physical) addresses are N bytes apart. For simplicity, let's assume A is on the cache boundary, that is, its address is an integer multiple of 64. 1) If N < 64, when A is fetched by CPU, B will be read into the cache, too. So if B is needed, and the cache line is not

Does a branch misprediction flush the entire pipeline, even for very short if-statement body?

不问归期 提交于 2019-12-10 03:26:23
问题 Everything I've read seems to indicate that a branch misprediction always results in the entire pipeline being flushed, which means a lot of wasted cycles. I never hear anyone mention any exceptions for short if-conditions. This seems like it would be really wasteful in some cases. For example, suppose you have a lone if-statement with a very simple body that is compiled down to 1 CPU instruction. The if-clause would be compiled into a conditional jump forward by one instruction. If the CPU

What does “store-buffer forwarding” mean in the Intel developer's manual?

家住魔仙堡 提交于 2019-12-09 14:50:20
问题 The Intel 64 and IA-32 Architectures Software Developer's Manual says the following about re-ordering of actions by a single processor (Section 8.2.2, "Memory Ordering in P6 and More Recent Processor Families"): Reads may be reordered with older writes to different locations but not with older writes to the same location. Then below when discussing points where this is relaxed compared to earlier processors, it says: Store-buffer forwarding, when a read passes a write to the same memory