cpu-architecture

How can a program's size increase the rate of cache misses?

怎甘沉沦 提交于 2019-12-02 09:21:31
问题 Wikipedia has the following statement in its Loop Unrolling article: Increased program code size, which can be undesirable, particularly for embedded applications. Can also cause an increase in instruction cache misses, which may adversely affect performance. Why is this? Also, having a program's code size be larger due to large amounts of dead code won't increase the rate of cache misses, since dead code won't be executed? 回答1: Code is typically read into caches in whole cache lines , which

How many and what size cycles will be needed to perform longword transferred to the CPU

吃可爱长大的小学妹 提交于 2019-12-02 08:32:53
The task is for architecture ColdFire processor MCF5271: I don't understand how many and what size cycles will be needed to perform a longword transfer to the CPU, or word transfers. I'm reading the chart and I don't see what the connection is? Any comments are very appreciated. I've attached 2 examples with the answers. DATA BUS SIZE The MCF5271 manual discusses the external interface of the processor in Chapter 17. The processor implements a byte-addressable address space with a 32-bit external data bus. The D[31:0] signals represent the data bus, the A[23:0] signals represent the address

How Jump instruction is executed based on value of Out- The Alu Output

余生长醉 提交于 2019-12-02 06:48:54
问题 This question was migrated from Computer Science Stack Exchange because it can be answered on Stack Overflow. Migrated 3 years ago . Figure from The Elements of Computer System (Nand2Tetris) Have a look at the scenario where j1 = 1 (out < 0 ) j2 = 0 (out = 0 ) j3 = 1 (out > 0 ) How this scenario is possible as out < 0 is true as well as out > 0 but out = 0 is false. How out can have both positive and negative values at the same time? In other words when JNE instruction is going to execute

Virtually addressed Cache

孤人 提交于 2019-12-02 05:06:30
Relation between cache size and page size How does the associativity and page size constrain the Cache size in virtually addressed cache architecture? Particularly I am looking for an example on the following statement: If C≤(page_size x associativity), the cache index bits come only from page offset (same in Virtual address and Physical address). Peter Cordes Intel CPUs have used 8-way associative 32kiB L1D with 64B lines for many years, for exactly this reason. Pages are 4k, so the page offset is 12 bits, exactly the same number of bits that make up the index and offset within a cache line.

How Jump instruction is executed based on value of Out- The Alu Output

本秂侑毒 提交于 2019-12-02 04:31:46
Figure from The Elements of Computer System (Nand2Tetris) Have a look at the scenario where j1 = 1 (out < 0 ) j2 = 0 (out = 0 ) j3 = 1 (out > 0 ) How this scenario is possible as out < 0 is true as well as out > 0 but out = 0 is false. How out can have both positive and negative values at the same time? In other words when JNE instruction is going to execute although it theoretically seems possible to me but practically its not? If out < 0, the jump is executed if j1 = 1. If out = 0, the jump is executed if j2 = 1. If out > 0, the jump is executed if j3 = 1. Hopefully now you can understand

Why is it better to use the ebp than the esp register to locate parameters on the stack?

我只是一个虾纸丫 提交于 2019-12-02 04:22:38
I am new to MASM. I have confusion regarding these pointer registers. I would really appreciate if you guys help me. Thanks Encoding an addressing mode using [ebp + disp8] is one byte shorter than [esp+disp8] , because using ESP as a base register requires a SIB byte. See rbp not allowed as SIB base? for details. (That question title is asking about the fact that [ebp] has to be encoded as [ebp+0] .) The first time [esp + disp8] is used after a push or pop, or after a call , will require a stack-sync uop on Intel CPUs. ( What is the stack engine in the Sandybridge microarchitecture? ). Of

Regarding instruction ordering in executions of cache-miss loads before cache-hit stores on x86

ぐ巨炮叔叔 提交于 2019-12-02 02:03:00
问题 Given the small program shown below (handcrafted to look the same from a sequential consistency / TSO perspective), and assuming it's being run by a superscalar out-of-order x86 cpu: Load A <-- A in main memory Load B <-- B is in L2 Store C, 123 <-- C is L1 I have a few questions: Assuming a big enough instruction-window, will the three instructions be fetched, decoded, executed at the same time? I assume not, as that would break execution in program order. The 2nd load is going to take

When could 2 virtual addresses map to the same physical address?

会有一股神秘感。 提交于 2019-12-02 00:57:23
问题 An operating system/computer architecture question here. I was reading about caches, about how virtually indexing the cache is an option to reduce address translation time. I came across the following: "Virtual cache difficulties include: Aliasing Two different virtual addresses may have the same physical address." I can't think of a scenario when this can occur. It's been a while since my O/S days and I'm drawing a blank. Could someone provide an example? Thanks 回答1: Two processes might have

Is it possible to perform some computations within the RAM?

自古美人都是妖i 提交于 2019-12-02 00:47:15
问题 Theoretically, is there any way to perform any computations within the RAM, using memory related instructions such as move , clflush or whatever, such as an xor between two adjacent rows for example? With my limited knowledge about RAM and assembly, I can't think of any such possibilities. 回答1: No, any computation is done in the CPU (or GPU, or other system devices that can load/store to RAM). Even the Turing-complete mov stuff that @PaulR linked in a comment is just using the CPU's address

which is optimal a bigger block cache size or a smaller one?

一笑奈何 提交于 2019-12-02 00:40:29
Given a cache size with constant capacity and associativity, for a given code to determine average of array elements, would a cache with higher block size be preferred? [from comments] Examine the code given below to compute the average of an array: total = 0; for(j=0; j < k; j++) { sub_total = 0; /* Nested loops to avoid overflow */ for(i=0; i < N; i++) { sub_total += A[jN + i]; } total += sub_total/N; } average = total/k; Related: in the more general case of typical access patterns with some but limited spatial locality, larger lines help up to a point. These "Memory Hierarchy: Set