sse2

Emulating shifts on 32 bytes with AVX

阅读更多关于 Emulating shifts on 32 bytes with AVX

问题 I am migrating vectorized code written using SSE2 intrinsics to AVX2 intrinsics. Much to my disappointment, I discover that the shift instructions _mm256_slli_si256 and _mm256_srli_si256 operate only on the two halves of the AVX registers separately and zeroes are introduced in between. (This is by contrast with _mm_slli_si128 and _mm_srli_si128 that handle whole SSE registers.) Can you recommend me a short substitute ? UPDATE: _mm256_slli_si256 is efficiently achieved with _mm256_alignr_epi8

Performance optimisations of x86-64 assembly - Alignment and branch prediction

阅读更多关于 Performance optimisations of x86-64 assembly - Alignment and branch prediction

问题 I’m currently coding highly optimised versions of some C99 standard library string functions, like strlen() , memset() , etc, using x86-64 assembly with SSE-2 instructions. So far I’ve managed to get excellent results in terms of performance, but I sometimes get weird behaviour when I try to optimise more. For instance, adding or even removing some simple instructions, or simply reorganising some local labels used with jumps completely degrades the overall performances. And there’s absolutely

Is it possible to use SSE and SSE2 to make a 128-bit wide integer?

阅读更多关于 Is it possible to use SSE and SSE2 to make a 128-bit wide integer?

问题 I\'m looking to understand SSE2\'s capabilities a little more, and would like to know if one could make a 128-bit wide integer that supports addition, subtraction, XOR and multiplication? 回答1: SSE2 has no carry flag but you can easily calculate the carry as carry = sum < a or carry = sum < b like this. But worse yet, SSE2 doesn't have 64-bit comparisons too, so you must use some workarounds like the one here Here is an untested, unoptimized C code based on the idea above. inline bool lessthan

Extended (80-bit) double floating point in x87, not SSE2 - we don't miss it?

阅读更多关于 Extended (80-bit) double floating point in x87, not SSE2 - we don't miss it?

问题 I was reading today about researchers discovering that NVidia\'s Phys-X libraries use x87 FP vs. SSE2. Obviously this will be suboptimal for parallel datasets where speed trumps precision. However, the article author goes on to quote: Intel started discouraging the use of x87 with the introduction of the P4 in late 2000. AMD deprecated x87 since the K8 in 2003, as x86-64 is defined with SSE2 support; VIA’s C7 has supported SSE2 since 2005. In 64-bit versions of Windows, x87 is deprecated for

Emulating shifts on 32 bytes with AVX

Performance optimisations of x86-64 assembly - Alignment and branch prediction

Is it possible to use SSE and SSE2 to make a 128-bit wide integer?

Extended (80-bit) double floating point in x87, not SSE2 - we don&#39;t miss it?

Extended (80-bit) double floating point in x87, not SSE2 - we don't miss it?