sse2 | 易学教程

Determine processor support for SSE2?

阅读更多关于 Determine processor support for SSE2?

I need to do determine processor support for SSE2 prior installing a software. From what I understand, I came up with this: bool TestSSE2(char * szErrorMsg) { __try { __asm { xorpd xmm0, xmm0 // executing SSE2 instruction } } #pragma warning (suppress: 6320) __except (EXCEPTION_EXECUTE_HANDLER) { if (_exception_code() == STATUS_ILLEGAL_INSTRUCTION) { _tcscpy_s(szErrorMsg,MSGSIZE, _T("Streaming SIMD Extensions 2(SSE2) is not supported by the CPU.\r\n Unable to launch APP")); return false; } _tcscpy_s(szErrorMsg,MSGSIZE, _T("Streaming SIMD Extensions 2(SSE2) is not supported by the CPU.\r\n

Fast counting the number of equal bytes between two arrays [duplicate]

阅读更多关于 Fast counting the number of equal bytes between two arrays [duplicate]

This question already has an answer here: Can counting byte matches between two strings be optimized using SIMD? 3 answers I wrote the function int compare_16bytes(__m128i lhs, __m128i rhs) in order to compare two 16 byte numbers using SSE instructions: this function returns how many bytes are equal after performing the comparison. Now I would like use the above function in order to compare two byte arrays of arbitrary length: the length may not be a multiple of 16 bytes, so I need deal with this problem. How could I complete the implementation of the function below? How could I improve the

Valgrind and Java

阅读更多关于 Valgrind and Java

I want to use Valgrind 3.7.0 to find memory leaks in my Java native code. I'm using jdk1.6.0._29. To do that, I have to set the --trace-children=yes flag. Setting that flag, I no longer can run valgrind on any java application, even a command like: valgrind --trace-children=yes --smc-check=all java -version will get the error message: Error occurred during initialization of VM Unknown x64 processor: SSE2 not supported I've seen this link: https://bugs.kde.org/show_bug.cgi?id=249943 , but it was not useful. Running the program without Valgrind or without the --trace-children flag is fine. Does

SSE instructions to add all elements of an array [duplicate]

阅读更多关于 SSE instructions to add all elements of an array [duplicate]

This question already has an answer here: Sum reduction of unsigned bytes without overflow, using SSE2 on Intel 2 answers I am new to SSE2 instructions. I have found an instruction _mm_add_epi8 which can add two array elements. But I want an SSE instruction which can add all elements of an array. I was trying to develop this concept using this code: #include <iostream> #include <conio.h> #include <emmintrin.h> void sse(unsigned char* a,unsigned char* b); void main() { /*unsigned char *arr; arr=(unsigned char *)malloc(50);*/ unsigned char arr[]={'a','b','c','d','e','f','i','j','k','l','m','n',

SSE multiplication of 2 64-bit integers

阅读更多关于 SSE multiplication of 2 64-bit integers

问题 How to multiply two 64-bit integers by another 2 64-bit integers? I didn't find any instruction which can do it. 回答1: I know this is an old question but I was actually looking for exactly this. As there's still no instruction for it I implemented the 64 bit multiply myself with the pmuldq as Paul R mentioned. This is what I came up with: // requires g++ -msse4.1 ... #include <emmintrin.h> #include <smmintrin.h> __m128i Multiply64Bit(__m128i a, __m128i b) { auto ax0_ax1_ay0_ay1 = a; auto bx0

Fast counting the number of set bits in __m128i register

阅读更多关于 Fast counting the number of set bits in __m128i register

I should count the number of set bits of a __m128i register. In particular, I should write two functions that are able to count the number of bits of the register, using the following ways. The total number of set bits of the register. The number of set bits for each byte of the register. Are there intrinsic functions that can perform, wholly or partially, the above operations? Here are some codes I used in an old project ( there is a research paper about it ). The function popcnt8 below computes the number of bits set in each byte. SSE2-only version (based on Algorithm 3 in Hacker's Delight

SSE instruction MOVSD (extended: floating point scalar & vector operations on x86, x86-64)

阅读更多关于 SSE instruction MOVSD (extended: floating point scalar & vector operations on x86, x86-64)

问题 I am somehow confused by the MOVSD assembly instruction. I wrote some numerical code computing some matrix multiplication, simply using ordinary C code with no SSE intrinsics. I do not even include the header file for SSE2 intrinsics for compilation. But when I check the assembler output, I see that: 1) 128-bit vector registers XMM are used; 2) SSE2 instruction MOVSD is invoked. I understand that MOVSD essentially operates on single double precision floating point. It only uses the lower 64

Scaling byte pixel values (y=ax+b) with SSE2 (as floats)?

阅读更多关于 Scaling byte pixel values (y=ax+b) with SSE2 (as floats)?

I want to calculate y = ax + b , where x and y is a pixel value [i.e, byte with value range is 0~255], while a and b is a float Since I need to apply this formula for each pixel in image, in addition, a and b is different for different pixel. Direct calculation in C++ is slow, so I am kind of interest to know the sse2 instruction in c++.. After searching, I find that the multiplication and addition in float with sse2 is just as _mm_mul_ps and _mm_add_ps . But in the first place I need to convert the x in byte to float (4 byte). The question is, after I load the data from byte-data source ( _mm

What's the difference between logical SSE intrinsics?

阅读更多关于 What's the difference between logical SSE intrinsics?

Is there any difference between logical SSE intrinsics for different types? For example if we take OR operation, there are three intrinsics: _mm_or_ps, _mm_or_pd and _mm_or_si128 all of which do the same thing: compute bitwise OR of their operands. My questions: Is there any difference between using one or another intrinsic (with appropriate type casting). Won't there be any hidden costs like longer execution in some specific situation? These intrinsics maps to three different x86 instructions (por, orps, orpd). Does anyone have any ideas why Intel is wasting precious opcode space for several

How to divide 16-bit integer by 255 with using SSE?

阅读更多关于 How to divide 16-bit integer by 255 with using SSE?

问题 I deal with image processing. I need to divide 16-bit integer SSE vector by 255. I can't use shift operator like _mm_srli_epi16(), because 255 is not a multiple of power of 2. I know of course that it is possible convert integer to float, perform division and then back conversion to integer. But might somebody knows another solution... 回答1: There is an integer approximation of division by 255: inline int DivideBy255(int value) { return (value + 1 + (value >> 8)) >> 8; } So with using of SSE2