sse | 易学教程

Fastest way to horizontally sum SSE unsigned byte vector

阅读更多关于 Fastest way to horizontally sum SSE unsigned byte vector

问题 I need to horizontally add a __m128i that is 16 x epi8 values. The XOP instructions would make this trivial, but I don't have those available. Current method is: hd = _mm_hadd_epi16(_mm_cvtepi8_epi16(sum), _mm_cvtepi8_epi16(_mm_shuffle_epi8(sum, swap))); hd = _mm_hadd_epi16(hd, hd); hd = _mm_hadd_epi16(hd, hd); Is there a better way with up to SSE4.1? 回答1: You can do it with SSE2's _mm_sad_epu8 (psadbw), e.g.: inline uint32_t _mm_sum_epu8(const __m128i v) { __m128i vsum = _mm_sad_epu8(v, _mm

Does SSE/AVX provide a means of determining if a result was rounded up?

阅读更多关于 Does SSE/AVX provide a means of determining if a result was rounded up?

问题 One of the purposes of the C1 bit in the x87 FPU status word is to show whether or not an inexact result was rounded up. Does SSE/AVX provide any such indication for scalar operations? I did not see a similar bit in the MXCSR register. Am I forced to use x87 instructions if I want this information? 回答1: SSE/AVX do not provide hardware support for detecting this, even for scalar instructions like addss . SSE was designed for SIMD, with 4 floats per XMM vector, and presumably Intel didn't want

How to compute sincos fast on a x64 CPU?

阅读更多关于 How to compute sincos fast on a x64 CPU?

来源： https://stackoverflow.com/questions/48971804/how-to-compute-sincos-fast-on-a-x64-cpu

xmm instructions - segmentation fault with memory source operand

阅读更多关于 xmm instructions - segmentation fault with memory source operand

来源： https://stackoverflow.com/questions/14014622/xmm-instructions-segmentation-fault-with-memory-source-operand

inline assembly code to read/write XMM & YMM registers?

阅读更多关于 inline assembly code to read/write XMM & YMM registers?

来源： https://stackoverflow.com/questions/57313195/inline-assembly-code-to-read-write-xmm-ymm-registers

inline assembly code to read/write XMM & YMM registers?

阅读更多关于 inline assembly code to read/write XMM & YMM registers?

来源： https://stackoverflow.com/questions/57313195/inline-assembly-code-to-read-write-xmm-ymm-registers

AVX 256-bit code performing slightly worse than equivalent 128-bit SSSE3 code

阅读更多关于 AVX 256-bit code performing slightly worse than equivalent 128-bit SSSE3 code

来源： https://stackoverflow.com/questions/31466848/avx-256-bit-code-performing-slightly-worse-than-equivalent-128-bit-ssse3-code

AVX 256-bit code performing slightly worse than equivalent 128-bit SSSE3 code

阅读更多关于 AVX 256-bit code performing slightly worse than equivalent 128-bit SSSE3 code

来源： https://stackoverflow.com/questions/31466848/avx-256-bit-code-performing-slightly-worse-than-equivalent-128-bit-ssse3-code

How to optimise this 8-bit positional popcount using assembly?

阅读更多关于 How to optimise this 8-bit positional popcount using assembly?

来源： https://stackoverflow.com/questions/63248047/how-to-optimise-this-8-bit-positional-popcount-using-assembly

How to optimise this 8-bit positional popcount using assembly?

阅读更多关于 How to optimise this 8-bit positional popcount using assembly?

来源： https://stackoverflow.com/questions/63248047/how-to-optimise-this-8-bit-positional-popcount-using-assembly