sse | 易学教程

Fast 24-bit array -> 32-bit array conversion?

阅读更多关于 Fast 24-bit array -> 32-bit array conversion?

问题 Quick Summary: I have an array of 24-bit values. Any suggestion on how to quickly expand the individual 24-bit array elements into 32-bit elements? Details: I'm processing incoming video frames in realtime using Pixel Shaders in DirectX 10. A stumbling block is that my frames are coming in from the capture hardware with 24-bit pixels (either as YUV or RGB images), but DX10 takes 32-bit pixel textures. So, I have to expand the 24-bit values to 32-bits before I can load them into the GPU. I

Fast counting the number of set bits in __m128i register

阅读更多关于 Fast counting the number of set bits in __m128i register

问题 I should count the number of set bits of a __m128i register. In particular, I should write two functions that are able to count the number of bits of the register, using the following ways. The total number of set bits of the register. The number of set bits for each byte of the register. Are there intrinsic functions that can perform, wholly or partially, the above operations? 回答1: Here are some codes I used in an old project (there is a research paper about it). The function popcnt8 below

Questions about the performance of different implementations of strlen [closed]

阅读更多关于 Questions about the performance of different implementations of strlen [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 years ago . I have implemented the strlen() function in different ways, including SSE2 assembly , SSE4.2 assembly and SSE2 intrinsic , I also exerted some experiments on them, with strlen() in <string.h> and strlen() in glibc . However, their performance in terms of milliseconds (time) are unexpected. My experiment

Most efficient way to check if all __m128i components are 0 [using <= SSE4.1 intrinsics]

阅读更多关于 Most efficient way to check if all __m128i components are 0 [using

问题 I am using SSE intrinsics to determine if a rectangle (defined by four int32 values) has changed: __m128i oldRect; // contains old left, top, right, bottom packed to 128 bits __m128i newRect; // contains new left, top, right, bottom packed to 128 bits __m128i xor = _mm_xor_si128(oldRect, newRect); At this point, the resulting xor value will be all zeros if the rectangle hasn't changed. What is then the most efficient way of determining that? Currently I am doing so: if (xor.m128i_u64[0] | xor

What's the difference between logical SSE intrinsics?

阅读更多关于 What's the difference between logical SSE intrinsics?

问题 Is there any difference between logical SSE intrinsics for different types? For example if we take OR operation, there are three intrinsics: _mm_or_ps, _mm_or_pd and _mm_or_si128 all of which do the same thing: compute bitwise OR of their operands. My questions: Is there any difference between using one or another intrinsic (with appropriate type casting). Won't there be any hidden costs like longer execution in some specific situation? These intrinsics maps to three different x86

Newton Raphson with SSE2 - can someone explain me these 3 lines

阅读更多关于 Newton Raphson with SSE2 - can someone explain me these 3 lines

问题 I'm reading this document: http://software.intel.com/en-us/articles/interactive-ray-tracing and I stumbled upon these three lines of code: The SIMD version is already quite a bit faster, but we can do better. Intel has added a fast 1/sqrt(x) function to the SSE2 instruction set. The only drawback is that its precision is limited. We need the precision, so we refine it using Newton-Rhapson: __m128 nr = _mm_rsqrt_ps( x ); __m128 muls = _mm_mul_ps( _mm_mul_ps( x, nr ), nr ); result = _mm_mul_ps(

best cross-platform method to get aligned memory

阅读更多关于 best cross-platform method to get aligned memory

问题 Here is the code I normally use to get aligned memory with Visual Studio and GCC inline void* aligned_malloc(size_t size, size_t align) { void *result; #ifdef _MSC_VER result = _aligned_malloc(size, align); #else if(posix_memalign(&result, align, size)) result = 0; #endif return result; } inline void aligned_free(void *ptr) { #ifdef _MSC_VER _aligned_free(ptr); #else free(ptr); #endif } Is this code fine in general? I have also seen people use _mm_malloc , _mm_free . In most cases that I want

Fastest way to do horizontal vector sum with AVX instructions [duplicate]

阅读更多关于 Fastest way to do horizontal vector sum with AVX instructions [duplicate]

问题 This question already has answers here : Get sum of values stored in __m256d with SSE/AVX (2 answers) Closed 11 months ago . I have a packed vector of four 64-bit floating-point values. I would like to get the sum of the vector's elements. With SSE (and using 32-bit floats) I could just do the following: v_sum = _mm_hadd_ps(v_sum, v_sum); v_sum = _mm_hadd_ps(v_sum, v_sum); Unfortunately, even though AVX features a _mm256_hadd_pd instruction, it differs in the result from the SSE version. I

SSE multiplication of 4 32-bit integers

阅读更多关于 SSE multiplication of 4 32-bit integers

问题 How to multiply four 32-bit integers by another 4 integers? I didn't find any instruction which can do it. 回答1: If you need signed 32x32 bit integer multiplication then the following example at software.intel.com looks like it should do what you want: static inline __m128i muly(const __m128i &a, const __m128i &b) { __m128i tmp1 = _mm_mul_epu32(a,b); /* mul 2,0*/ __m128i tmp2 = _mm_mul_epu32( _mm_srli_si128(a,4), _mm_srli_si128(b,4)); /* mul 3,1 */ return _mm_unpacklo_epi32(_mm_shuffle_epi32

Efficient 4x4 matrix multiplication (C vs assembly)

阅读更多关于 Efficient 4x4 matrix multiplication (C vs assembly)

问题 I'm looking for a faster and trickier way to multiply two 4x4 matrices in C. My current research is focused on x86-64 assembly with SIMD extensions. So far, I've created a function witch is about 6x faster than a naive C implementation, which has exceeded my expectations for the performance improvement. Unfortunately, this stays true only when no optimization flags are used for compilation (GCC 4.7). With -O2 , C becomes faster and my effort becomes meaningless. I know that modern compilers