sse2 | 易学教程

SIMD code runs slower than scalar code

阅读更多关于 SIMD code runs slower than scalar code

elma and elmc are both unsigned long arrays. So are res1 and res2 . unsigned long simdstore[2]; __m128i *p, simda, simdb, simdc; p = (__m128i *) simdstore; for (i = 0; i < _polylen; i++) { u1 = (elma[i] >> l) & 15; u2 = (elmc[i] >> l) & 15; for (k = 0; k < 20; k++) { //res1[i + k] ^= _mulpre1[u1][k]; //res2[i + k] ^= _mulpre2[u2][k]; simda = _mm_set_epi64x (_mulpre2[u2][k], _mulpre1[u1][k]); simdb = _mm_set_epi64x (res2[i + k], res1[i + k]); simdc = _mm_xor_si128 (simda, simdb); _mm_store_si128 (p, simdc); res1[i + k] = simdstore[0]; res2[i + k] = simdstore[1]; } } Within the for loop is

How to divide 16-bit integer by 255 with using SSE?

阅读更多关于 How to divide 16-bit integer by 255 with using SSE?

I deal with image processing. I need to divide 16-bit integer SSE vector by 255. I can't use shift operator like _mm_srli_epi16(), because 255 is not a multiple of power of 2. I know of course that it is possible convert integer to float, perform division and then back conversion to integer. But might somebody knows another solution... There is an integer approximation of division by 255: inline int DivideBy255(int value) { return (value + 1 + (value >> 8)) >> 8; } So with using of SSE2 it will look like: inline __m128i DivideI16By255(__m128i value) { return _mm_srli_epi16(_mm_add_epi16( _mm

How to process a 24-bit 3 channel color image with SSE2/SSE3/SSE4?

阅读更多关于 How to process a 24-bit 3 channel color image with SSE2/SSE3/SSE4?

I just started to use SS2 optimization of image processing, but for the 3 channel 24 bit color images have no idea. My pix data arranged by BGR BGR BGR ... ,unsigned char 8-bi, so if I want to implement the Color2Gray with SSE2/SSE3/SSE4's instruction C/C++ fun ,how would I do? Does need to align(4/8/16) for my pix data? I have read article: http://supercomputingblog.com/windows/image-processing-with-sse/ But it is ARGB 4 channel 32-bit color,exactly process 4 color pix data every time. Thanks! //Assume the original pixel: unsigned char* pDataColor=(unsigned char*)malloc(src.width*src.height*3

SSE2 option in Visual C++ (x64)

阅读更多关于 SSE2 option in Visual C++ (x64)

I've added x64 configuration to my C++ project to compile 64-bit version of my app. Everything looks fine, but compiler gives the following warning: `cl : Command line warning D9002 : ignoring unknown option '/arch:SSE2'` Is there SSE2 optimization really not available for 64-bit projects? Seems to be all 64-bit processors has SSE2. Since compiler option always switched on by default no need to switch it on manually. From Wikipedia : SSE instructions : The original AMD64 architecture adopted Intel's SSE and SSE2 as core instructions. SSE3 instructions were added in April 2005. SSE2 replaces

SIMD code runs slower than scalar code

阅读更多关于 SIMD code runs slower than scalar code

问题 elma and elmc are both unsigned long arrays. So are res1 and res2 . unsigned long simdstore[2]; __m128i *p, simda, simdb, simdc; p = (__m128i *) simdstore; for (i = 0; i < _polylen; i++) { u1 = (elma[i] >> l) & 15; u2 = (elmc[i] >> l) & 15; for (k = 0; k < 20; k++) { //res1[i + k] ^= _mulpre1[u1][k]; //res2[i + k] ^= _mulpre2[u2][k]; simda = _mm_set_epi64x (_mulpre2[u2][k], _mulpre1[u1][k]); simdb = _mm_set_epi64x (res2[i + k], res1[i + k]); simdc = _mm_xor_si128 (simda, simdb); _mm_store

Best way to load a 64-bit integer to a double precision SSE2 register?

阅读更多关于 Best way to load a 64-bit integer to a double precision SSE2 register?

问题 What is the best/fastest way to load a 64-bit integer value in an xmm SSE2 register in 32-bit mode? In 64-bit mode, cvtsi2sd can be used, but in 32-bit mode, it supports only 32-bit integers. So far I haven't found much beyond: use fild , fstp to stack then movsd to xmm register load the high 32-bit portion, multiply by 2^32, add the low 32-bit First solution is slow, second solution might introduce precision loss ( edit: and it is slow anyway, since the low 32 bit have to be converted as

How to process a 24-bit 3 channel color image with SSE2/SSE3/SSE4?

阅读更多关于 How to process a 24-bit 3 channel color image with SSE2/SSE3/SSE4?

问题 I just started to use SS2 optimization of image processing, but for the 3 channel 24 bit color images have no idea. My pix data arranged by BGR BGR BGR ... ,unsigned char 8-bi, so if I want to implement the Color2Gray with SSE2/SSE3/SSE4's instruction C/C++ fun ,how would I do? Does need to align(4/8/16) for my pix data? I have read article:http://supercomputingblog.com/windows/image-processing-with-sse/ But it is ARGB 4 channel 32-bit color,exactly process 4 color pix data every time. Thanks

SSE instruction set not enabled

阅读更多关于 SSE instruction set not enabled

问题 I am getting trouble with this error: "SSE instruction set not enabled" . How I can figure this out? I have ACER i7, Ubuntu 11.10, please any one can help me? Any help will be appreciated! Also running: sudo cat /proc/cpuinfo | grep flags Gives: flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clfl ush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx rdtscp lm constant_tsc arch_perfm on pebs bts xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est

SSE instruction MOVSD (extended: floating point scalar & vector operations on x86, x86-64)

阅读更多关于 SSE instruction MOVSD (extended: floating point scalar & vector operations on x86, x86-64)

I am somehow confused by the MOVSD assembly instruction. I wrote some numerical code computing some matrix multiplication, simply using ordinary C code with no SSE intrinsics. I do not even include the header file for SSE2 intrinsics for compilation. But when I check the assembler output, I see that: 1) 128-bit vector registers XMM are used; 2) SSE2 instruction MOVSD is invoked. I understand that MOVSD essentially operates on single double precision floating point. It only uses the lower 64-bit of an XMM register and set the upper 64-bit 0. But I just don't understand two things: 1) I never

SSE2 option in Visual C++ (x64)

阅读更多关于 SSE2 option in Visual C++ (x64)

问题 I've added x64 configuration to my C++ project to compile 64-bit version of my app. Everything looks fine, but compiler gives the following warning: `cl : Command line warning D9002 : ignoring unknown option '/arch:SSE2'` Is there SSE2 optimization really not available for 64-bit projects? 回答1: Seems to be all 64-bit processors has SSE2. Since compiler option always switched on by default no need to switch it on manually. From Wikipedia: SSE instructions : The original AMD64 architecture