cpu-architecture

When CPU flush value in storebuffer to L1 Cache?

爱⌒轻易说出口 提交于 2019-12-13 03:41:14
问题 Core A writes value x to storebuffer, waiting invalid ack and then flushes x to cache. Does it wait only one ack or wait all acks ? And how does it konw how many acks in all CPUs ? 回答1: It isn't clear to me what you mean by "invalid ack", but let's assume you mean a snoop/invalidation originating from another core which is requesting ownership of the same line. In this case, the stores in the store buffer are generally free to ignore such invalidations from other cores since the stores in the

Pipeline Stall Related to BNE Execution and Label Instruction Fetch

為{幸葍}努か 提交于 2019-12-13 03:39:12
问题 Below is the solution related to a pipeline question. After reading the solution, I have a question. Why the first line bne $7, $0, L1 EX is at same cycle for the IF of last line L1:sw $8, 0($3) ? For my understanding, before instruction fetch for the last line, it should wait until bne finish executing the condition and knowing if it needs to fetch instruction or not. Any hint appreciates. Thanks so much for your time and help. 回答1: According to https://en.wikipedia.org/wiki/Classic_RISC

Slowdown when accessing data at page boundaries?

百般思念 提交于 2019-12-13 00:47:46
问题 (My question is related to computer architecture and performance understanding. Did not find a relevant forum, so post it here as a general question.) I have a C program which accesses memory words that are located X bytes apart in virtual address space. For instance, for (int i=0;<some stop condition>;i+=X){array[i]=4;} . I measure the execution time with a varying value of X . Interestingly, when X is the power of 2 and is about page size, e.g., X=1024,2048,4096,8192... , I get to huge

size of page of memory - reasoning using offset

Deadly 提交于 2019-12-12 23:48:32
问题 It is theoretical consideration, but I think that this forum is ok for it. If Am I wrong ask for move it somewhere. Virtual address (a8b43e) was mapped to (13fcb43e). What can we say about size of page ? So, we should look at maximal possible size of offset. Here, we can see that matching suffixes is b43e. Moreover, we should look at binary representation of: 8 and c: 8 = 1000 c = 1100 So two last bits are matching. On the whole size of page is <= 2^{4*4+2} = 2^18. Is it ok ? 回答1: Your math

Why is it not possible to read an unaligned word in one step?

六眼飞鱼酱① 提交于 2019-12-12 14:34:22
问题 Given that the word size of a CPU allows it to address every single byte in the memory. And given that via PAE CPUs can even use more bits than its word size for addressing. What is the reason that a CPU cannot read an unaligned word in one step? For example, in a 32-bit machine you can read the 4-byte chunk starting at position 0, but you cannot read the one starting at position 1 (you can but it needs several steps). Why can CPUs not do that? 回答1: The problem is not with the ability of the

Read odd addresses, half words?

北慕城南 提交于 2019-12-12 13:36:15
问题 It's common knowledge that many CPU architectures (ARM, PPC) can not read odd addresses but will generate an exception if forced too, and yet others can, but do so slightly slower. (x86) But is there any CPU which can only address full 32 bit (or even larger!) words? I.e. it can not address 16 bit words? Perhaps amd64? I am trying to write a portable yet fast C malloc like allocator and want to align my memory accesses properly. Currently I am targeting ARM, i386 and amd64, and these I could

Indexed addressing mode and implied addressing mode

跟風遠走 提交于 2019-12-12 13:19:47
问题 Indexed addressing mode is usually used for accessing arrays as arrays are stored contiguosly. We have a index register which gets incremented in every iteration which when added to base address gives the array element address. I don't understand the actual need of this addressing mode. Why can't we do this with direct addressing ? We have the base address and we can just add 1 to it every time when accessing. Why do we need indexed addressing mode which has a overhead of index register ? I

On which CPU architectures are writes to an int “implicitly volatile” using the CLR (and variants)?

对着背影说爱祢 提交于 2019-12-12 12:35:10
问题 I recently learnt here the following is thread-safe on x86 CPU with the x86 CLR (not necessarily ECMA standard CLR) public class SometimesThreadSafe { private int value; public int Value { get { return value; } } public void Update() { Interlocked.Add(ref value, 47); } } This is because writing to an int on such architectures ensures any other CPU caches of value are synched. On ARM CPUs however this is not thread safe! As reading value from a different thread could read an old copy from a

Is processor can do memory and arithmetic operation at the same time?

浪子不回头ぞ 提交于 2019-12-12 12:22:49
问题 In the study of assembler and processor, one thing takes me out, how is done the instruction : add mem, 1 In my head, the processor cannot load the memory value and process the arithmetic operation during the same instruction. So I figure it takes place like: mov reg, mem add reg, 1 mov mem, reg If I consider a processor with a RISC Pipeline, we can observe some stalls. It's surprising for an instruction as simple as i++ : | Fetch | Decode | Exec | Memory | WriteB | | Fetch | | | Decode |

what does STREAM memory bandwidth benchmark really measure?

喜欢而已 提交于 2019-12-12 12:19:34
问题 I have a few questions on STREAM (http://www.cs.virginia.edu/stream/ref.html#runrules) benchmark. Below is the comment from stream.c. What is the rationale about the requirement that arrays should be 4 times the size of cache? * (a) Each array must be at least 4 times the size of the * available cache memory. I don't worry about the difference * between 10^6 and 2^20, so in practice the minimum array size * is about 3.8 times the cache size. I originally assume STREAM measures the peak memory