cpu-architecture

Is CMOVcc considered a branching instruction?

笑着哭i 提交于 2020-08-20 07:27:40
问题 I have this memchr code that I'm trying to make non-branching: .globl memchr memchr: mov %rdx, %rcx mov %sil, %al cld repne scasb lea -1(%rdi), %rax test %rcx, %rcx cmove %rcx, %rax ret I'm unsure whether or not cmove is a branching instruction. Is it? If so, how do I rearrange my code so it doesn't branch? 回答1: No, it's not a branch, that's the whole point of cmovcc . It's an ALU select that has a data dependency on both inputs, not a control dependency . (With a memory source, it

Can constant non-invariant tsc change frequency across cpu states?

不问归期 提交于 2020-08-20 03:45:01
问题 I used to benchmark Linux System Calls with rdtsc to get the counter difference before and after the system call. I interpreted the result as wall clock timer since TSC increments at constant rate and does not stop when entering halt state. The Invariant TSC concept is described as The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. Can a constant non-invariant tsc change frequency when changing state from C0 (operating) to C1 (halted)? My current view is that it

Can constant non-invariant tsc change frequency across cpu states?

不问归期 提交于 2020-08-20 03:44:31
问题 I used to benchmark Linux System Calls with rdtsc to get the counter difference before and after the system call. I interpreted the result as wall clock timer since TSC increments at constant rate and does not stop when entering halt state. The Invariant TSC concept is described as The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. Can a constant non-invariant tsc change frequency when changing state from C0 (operating) to C1 (halted)? My current view is that it

What do multiple values or ranges means as the latency for a single instruction?

只谈情不闲聊 提交于 2020-08-19 10:55:45
问题 I have a question about instruction latency on https://uops.info/. For some instructions like PCMPEQB(XMM, M128) the latency in the table entry for Skylake is listed as [1;≤8] I know a little about latency, but what i know is that it's just a single number !!! for example, 1 or 2 or 3 or ... but what is this [1;≤8] !!!??? It means latency depends on memory and it's between 1 and 8 ? If it's true, when is it 1 .. when is it 3, etc? For example, what is the latency for this : pcmpeqb xmm0,

What do multiple values or ranges means as the latency for a single instruction?

自古美人都是妖i 提交于 2020-08-19 10:54:02
问题 I have a question about instruction latency on https://uops.info/. For some instructions like PCMPEQB(XMM, M128) the latency in the table entry for Skylake is listed as [1;≤8] I know a little about latency, but what i know is that it's just a single number !!! for example, 1 or 2 or 3 or ... but what is this [1;≤8] !!!??? It means latency depends on memory and it's between 1 and 8 ? If it's true, when is it 1 .. when is it 3, etc? For example, what is the latency for this : pcmpeqb xmm0,

Is pipelining/OoOE available on modern x86 processors when running in real mode?

﹥>﹥吖頭↗ 提交于 2020-08-09 09:11:09
问题 When running a boot-loader program on a modern-day x86 processor, the processor will be running in real-address mode. Will its instruction pipelining features be active in real mode, or not? 回答1: Yes, the out-of-order core in modern microarchitectures operates basically the same regardless of mode . Most of the difference is in the decoders. See Agner Fog's microarch pdf and other links in the x86 tag wiki for details of how modern CPUs actually do work internally. It would probably take

Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster?

99封情书 提交于 2020-08-05 04:47:31
问题 I was messing around with optimizing a function using Google Benchmark, and ran into a situation where my code was unexpectedly slowing down in certain situations. I started experimenting with it, looking at the compiled assembly, and eventually came up with a minimal test case that exhibits the issue. Here's the assembly I came up with that exhibits this slowdown: .text test: #xorps %xmm0, %xmm0 cvtsi2ss %edi, %xmm0 addss %xmm0, %xmm0 addss %xmm0, %xmm0 addss %xmm0, %xmm0 addss %xmm0, %xmm0

Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster?

若如初见. 提交于 2020-08-05 04:47:11
问题 I was messing around with optimizing a function using Google Benchmark, and ran into a situation where my code was unexpectedly slowing down in certain situations. I started experimenting with it, looking at the compiled assembly, and eventually came up with a minimal test case that exhibits the issue. Here's the assembly I came up with that exhibits this slowdown: .text test: #xorps %xmm0, %xmm0 cvtsi2ss %edi, %xmm0 addss %xmm0, %xmm0 addss %xmm0, %xmm0 addss %xmm0, %xmm0 addss %xmm0, %xmm0

Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster?

爱⌒轻易说出口 提交于 2020-08-05 04:47:09
问题 I was messing around with optimizing a function using Google Benchmark, and ran into a situation where my code was unexpectedly slowing down in certain situations. I started experimenting with it, looking at the compiled assembly, and eventually came up with a minimal test case that exhibits the issue. Here's the assembly I came up with that exhibits this slowdown: .text test: #xorps %xmm0, %xmm0 cvtsi2ss %edi, %xmm0 addss %xmm0, %xmm0 addss %xmm0, %xmm0 addss %xmm0, %xmm0 addss %xmm0, %xmm0