cpu-architecture

Advantages of a 64 bit system

对着背影说爱祢 提交于 2019-11-29 15:29:04
问题 From a developer perspective i am trying to understand , what is the selling point of a 64-bit system ? I understand that more registers are at your disposal , more memory can be allocated to a process , but i cannot understand what makes a developer's life easier. Any examples ? From a performance perspective are there any gains seen if a program is run on a 32bit vs 64 bit ? Cheers! EDIT : Thank you for all your replies. I see some conversations shooting towards end user experience ,

Peak FLOPs per cycle for ARM11 and Cortex-A7 cores in Raspberry Pi 1 and 2

本秂侑毒 提交于 2019-11-29 14:44:44
I would like to know the peak FLOPs per cycle for the ARM1176JZF-S core in the the Raspberry Pi 1 and Cortex-A7 cores in the Raspberry Pi 2. From the ARM1176JZF-S Technical Reference Manual it seems that VFPv2 can do one SP MAC every clock cycle and one DP MAC every other clock cycle. In addition there are three pipelines which can operate in parallel: a MAC pipeline (FMAC), a division and sqrt pipeline (DS), and a load/store pipeline (LS). Based on this then it appears the ARM1176JZF-S of the Raspberry PI 1 can do at least (from the FMAC pipeline) 1 DP FLOP/cycle: one MAC/2 cycles 2 SP FLOPs

Why isn't there a data bus which is as wide as the cache line size?

不问归期 提交于 2019-11-29 14:09:44
问题 When a cache miss occurs, the CPU fetches a whole cache line from main memory into the cache hierarchy. (typically 64 bytes on x86_64) This is done via a data bus, which is only 8 byte wide on modern 64 bit systems. (since the word size is 8 byte) EDIT: "Data bus" means the bus between the CPU die and the DRAM modules in this context. This data bus width does not necessarily correlate with the word size. Depending on the strategy the actually requested address gets fetched at first, and then

How do Intel CPUs that use the ring bus topology decode and handle port I/O operations

ぃ、小莉子 提交于 2019-11-29 14:04:28
I understand Port I/O from a hardware abstraction level (i.e. asserts a pin that indicates to devices on the bus that the address is a port address, which makes sense on earlier CPUs with a simple address bus model) but I'm not really sure how it's implemented on modern CPUs microarchitecturally but also particularly how the Port I/O operation appears on the ring bus. Firstly. Where does the IN/OUT instruction get allocated to, the reservation station or the load/store buffer? My initial thoughts were that it would be allocated in the load/store buffer and the memory scheduler recognises it,

What is the “EU” in x86 architecture? (calculates effective address?)

青春壹個敷衍的年華 提交于 2019-11-29 13:50:25
I read somewhere that effective addresses (as in the LEA instruction) in x86 instructions are calculated by the "EU." What is the EU? What is involved exactly in calculating an effective address? I've only learned about the MC68k instruction set (UC Boulder teaches this first) and I can't find a good x86 webpage by searching the web. "EU" is the generic term for Execution Unit. The ALU is one example of an execution unit. FADD and FMUL, i.e. the floating point adder or multiplier, are other examples - as, for that matter are (is) the memory unit, for loads and stores. The EUs relevant to LEA

What is general difference between Superscalar and OoO execution?

帅比萌擦擦* 提交于 2019-11-29 12:25:45
问题 I've been reading some material on superscalr and OoO and I am confused. I think their architecture graphs look very much the same. 回答1: Superscalar microprocessors can execute two or more instructions at the same time. E.g. typically they have at least 2 ALUs (although a superscalar processor might have 1 ALU and some other execution unit, like a shifter or jump unit.) (More precisely, superscalar processors can start executing two or more instructions in the same cycle. Pipelined processors

Is a memory barrier an instruction that the CPU executes, or is it just a marker?

浪尽此生 提交于 2019-11-29 11:48:44
问题 I am trying to understand what is a memory barrier exactly. Based on what I know so far, a memory barrier (for example: mfence ) is used to prevent the re-ordering of instructions from before to after and from after to before the memory barrier. This is an example of a memory barrier in use: instruction 1 instruction 2 instruction 3 mfence instruction 4 instruction 5 instruction 6 Now my question is: Is the mfence instruction just a marker telling the CPU in what order to execute the

Is CPU access asymmetric to Network card

本秂侑毒 提交于 2019-11-29 11:07:16
When we have 2 CPU on a machine, do they have symmetric access to network cards (PCI)? Essentially, for a packet processing code, processing 14M packet per second from a network card, does that matter on which CPU it runs? Not sure if you still need an answer, but I will post an answer anyway in case someone else might need it. And I assume you are asking about hardware topology rather than OS irq affinity problems. Comment from Jerry is not 100% correct. While NUMA is SMP, but access to memory and PCIe resources from different NUMA nodes are not symmetric. It's symmetric as opposed to the

Where is the Write-Combining Buffer located? x86

巧了我就是萌 提交于 2019-11-29 10:42:20
How is the Write-Combine buffer physically hooked up? I have seen block diagrams illustrating a number of variants: Between L1 and Memory controller Between CPU's store buffer and Memory controller Between CPU's AGUs and/or store units Is it microarchitecture-dependent? Write buffers can have different purposes or different uses in different processors. This answer may not apply to processors not specifically mentioned. I'd like to emphasis that the term "write buffer" may mean different things in different contexts. This answer is about Intel and AMD processors only. Write-Combining Buffers

If I don't use fences, how long could it take a core to see another core's writes?

岁酱吖の 提交于 2019-11-29 04:23:44
I have been trying to Google my question but I honestly don't know how to succinctly state the question. Suppose I have two threads in a multi-core Intel system. These threads are running on the same NUMA node. Suppose thread 1 writes to X once, then only reads it occasionally moving forward. Suppose further that, among other things, thread 2 reads X continuously. If I don't use a memory fence, how long could it be between thread 1 writing X and thread 2 seeing the updated value? I understand that the write of X will go to the store buffer and from there to the cache, at which point MESIF will