cpu-architecture | 易学教程

Is there any way to write for Intel CPU direct core-to-core communication code?

阅读更多关于 Is there any way to write for Intel CPU direct core-to-core communication code?

来源： https://stackoverflow.com/questions/58741806/is-there-any-way-to-write-for-intel-cpu-direct-core-to-core-communication-code

Is CMOVcc considered a branching instruction?

阅读更多关于 Is CMOVcc considered a branching instruction?

问题 I have this memchr code that I'm trying to make non-branching: .globl memchr memchr: mov %rdx, %rcx mov %sil, %al cld repne scasb lea -1(%rdi), %rax test %rcx, %rcx cmove %rcx, %rax ret I'm unsure whether or not cmove is a branching instruction. Is it? If so, how do I rearrange my code so it doesn't branch? 回答1: No, it's not a branch, that's the whole point of cmovcc . It's an ALU select that has a data dependency on both inputs, not a control dependency . (With a memory source, it

Can constant non-invariant tsc change frequency across cpu states?

阅读更多关于 Can constant non-invariant tsc change frequency across cpu states?

问题 I used to benchmark Linux System Calls with rdtsc to get the counter difference before and after the system call. I interpreted the result as wall clock timer since TSC increments at constant rate and does not stop when entering halt state. The Invariant TSC concept is described as The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. Can a constant non-invariant tsc change frequency when changing state from C0 (operating) to C1 (halted)? My current view is that it

Can constant non-invariant tsc change frequency across cpu states?

阅读更多关于 Can constant non-invariant tsc change frequency across cpu states?

What do multiple values or ranges means as the latency for a single instruction?

阅读更多关于 What do multiple values or ranges means as the latency for a single instruction?

问题 I have a question about instruction latency on https://uops.info/. For some instructions like PCMPEQB(XMM, M128) the latency in the table entry for Skylake is listed as [1;≤8] I know a little about latency, but what i know is that it's just a single number !!! for example, 1 or 2 or 3 or ... but what is this [1;≤8] !!!??? It means latency depends on memory and it's between 1 and 8 ? If it's true, when is it 1 .. when is it 3, etc? For example, what is the latency for this : pcmpeqb xmm0,

What do multiple values or ranges means as the latency for a single instruction?

阅读更多关于 What do multiple values or ranges means as the latency for a single instruction?

Is pipelining/OoOE available on modern x86 processors when running in real mode?

阅读更多关于 Is pipelining/OoOE available on modern x86 processors when running in real mode?

问题 When running a boot-loader program on a modern-day x86 processor, the processor will be running in real-address mode. Will its instruction pipelining features be active in real mode, or not? 回答1: Yes, the out-of-order core in modern microarchitectures operates basically the same regardless of mode . Most of the difference is in the decoders. See Agner Fog's microarch pdf and other links in the x86 tag wiki for details of how modern CPUs actually do work internally. It would probably take

Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster?

阅读更多关于 Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster?

问题 I was messing around with optimizing a function using Google Benchmark, and ran into a situation where my code was unexpectedly slowing down in certain situations. I started experimenting with it, looking at the compiled assembly, and eventually came up with a minimal test case that exhibits the issue. Here's the assembly I came up with that exhibits this slowdown: .text test: #xorps %xmm0, %xmm0 cvtsi2ss %edi, %xmm0 addss %xmm0, %xmm0 addss %xmm0, %xmm0 addss %xmm0, %xmm0 addss %xmm0, %xmm0

Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster?

阅读更多关于 Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster?

Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster?

阅读更多关于 Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster?