micro-optimization

Which is the best way, in C, to see if a number is divisible by another?

◇◆丶佛笑我妖孽 提交于 2019-12-11 03:59:22
问题 Which is the best way, in C, to see if a number is divisible by another? I use this: if (!(a % x)) { // this will be executed if a is divisible by x } Is there anyway which is faster? I know that doing, i.e, 130 % 13 will result into doing 130 / 13 per 10 times. So there are 10 cycles when just one is needed (I just want to know if 130 is divisible by 13). Thanks! 回答1: I know that doing, i.e, 130 % 13 will result into doing 130 / 13 per 10 times Balderdash. % does no such thing on any

AVX512BW: handle 64-bit mask in 32-bit code with bsf / tzcnt?

浪子不回头ぞ 提交于 2019-12-10 23:09:46
问题 this is my code for 'strlen' function in AVX512BW vxorps zmm0, zmm0, zmm0 ; ZMM0 = 0 vpcmpeqb k0, zmm0, [ebx] ; ebx is string and it's aligned at 64-byte boundary kortestq k0, k0 ; 0x00 found ? jnz .chk_0x00 now for 'chk_0x00', in x86_64 systems, there is no problem and we can handle it like this: chk_0x00: kmovq rbx, k0 tzcnt rbx, rbx add rax, rbx here we have a 64-bit register so we can store the mask into it but my question is about x86 systems where we don't have any 64-bit register so we

Why isn't MOVNTI slower, in a loop storing repeatedly to the same address?

陌路散爱 提交于 2019-12-10 19:28:13
问题 section .text %define n 100000 _start: xor rcx, rcx jmp .cond .begin: movnti [array], eax .cond: add rcx, 1 cmp rcx, n jl .begin section .data array times 81920 db "A" According to perf it runs at 1.82 instructions per cycle. I cannot understand why it's so fast. After all, it has to be stored in memory (RAM) so it should be slow. P.S Is there any loop-carried-dependency? EDIT section .text %define n 100000 _start: xor rcx, rcx jmp .cond .begin: movnti [array+rcx], eax .cond: add rcx, 1 cmp

Are scaled-index addressing modes a good idea?

落爺英雄遲暮 提交于 2019-12-10 19:27:11
问题 Consider the following code: void foo(int* __restrict__ a) { int i; int val = 0; for (i = 0; i < 100; i++) { val = 2 * i; a[i] = val; } } This complies (with maximum optimization but no unrolling or vectorization) into... GCC 7.2: foo(int*): xor eax, eax .L2: mov DWORD PTR [rdi], eax add eax, 2 add rdi, 4 cmp eax, 200 jne .L2 rep ret clang 5.0: foo(int*): # @foo(int*) xor eax, eax .LBB0_1: # =>This Inner Loop Header: Depth=1 mov dword ptr [rdi + 2*rax], eax add rax, 2 cmp rax, 200 jne .LBB0_1

Does any of current C++ compilers ever emit “rep movsb/w/d”?

混江龙づ霸主 提交于 2019-12-10 18:18:52
问题 This question made me wonder, if current modern compilers ever emit REP MOVSB/W/D instruction. Based on this discussion, it seems that using REP MOVSB/W/D could be beneficial on current CPUs. But no matter how I tried, I cannot made any of the current compilers (GCC 8, Clang 7, MSVC 2017 and ICC 18) to emit this instruction. For this simple code, it could be reasonable to emit REP MOVSB : void fn(char *dst, const char *src, int l) { for (int i=0; i<l; i++) { dst[i] = src[i]; } } But compilers

Unexpected slowdown from inserting a nop in a loop, and from reading near a movnti store

耗尽温柔 提交于 2019-12-10 17:57:35
问题 I cannot understand why the first code has ~1 cycle per iteration and second has 2 cycle per iteration. I measured with Agner's tool and perf. According to IACA it should take 1 cycle, from my theoretical computations too. This takes 1 cycle per iteration. ; array is array defined in section data %define n 1000000 xor rcx, rcx .begin: movnti [array], eax add rcx, 1 cmp rcx, n jle .begin And this takes 2 cycles per iteration. but why? ; array is array defined in section data %define n 1000000

Why _umul128 works slower than scalar code for mul128x64x2 function?

情到浓时终转凉″ 提交于 2019-12-10 17:37:48
问题 I am second time trying to implement fast mul128x64x2 function. First time I ask the question without comparision with _umul128 MSVC version. Now I made such a comparison and the results that I got show that the _umul128 function slower then native scalar and handmade simd AVX 1.0 code. Below my test code: #include <iostream> #include <chrono> #include <intrin.h> #include <emmintrin.h> #include <immintrin.h> #pragma intrinsic(_umul128) constexpr uint32_t LOW[4] = { 4294967295u, 0u,

Faster way to test if xmm/ymm register is zero?

Deadly 提交于 2019-12-10 16:48:16
问题 It's fortunate that PTEST does not affect the carry flag, but only sets the (rather awkward) ZF. also affects both CF and ZF. I've come up with the following sequence to test a large number of values, but I'm unhappy with the poor running time. Latency / rThoughput setup: xor eax,eax ; na vpxor xmm0,xmm0 ; na ;mask to use for the nand operation of ptest work: vptest xmm4,xmm0 ; 3 1 ;is xmm4 alive? adc eax,eax ; 1 1 ;move first bit into eax vptest xmm5,xmm0 ; 3 1 ;is N alive? adc eax,eax ; 1 1

Why are bitwise operators slower than multiplication/division/modulo?

我与影子孤独终老i 提交于 2019-12-10 12:55:29
问题 It's a well known fact that multiplication, integer division, and modulo by powers of two can be rewritten more efficiently as bitwise operations: >>> x = randint(50000, 100000) >>> x << 2 == x * 4 True >>> x >> 2 == x // 4 True >>> x & 3 == x % 4 True In compiled languages such as C/C++ and Java, tests have shown that bitwise operations are generally faster than arithmetic operations. (See here and here). However, when I test these in Python, I am getting contrary results: In [1]: from

Fast search and replace some nibble in int [c; microoptimisation]

余生颓废 提交于 2019-12-09 13:02:59
问题 This is variant of Fast search of some nibbles in two ints at same offset (C, microoptimisation) question with different task: The task is to find a predefined nibble in int32 and replace it with other nibble. For example, nibble to search is 0x5; nibble to replace with is 0xe: int: 0x3d542753 (input) ^ ^ output:0x3dE427E3 (output int) There can be other pair of nibble to search and nibble to replace (known at compile time). I checked my program, this part is one of most hot place (gprof