sse | 易学教程

What is the equivalent of v4sf and attribute in Visual Studio C++?

阅读更多关于 What is the equivalent of v4sf and __attribute__ in Visual Studio C++?

问题 typedef float v4sf __attribute__ ((mode(V4SF))); This is in GCC. Anyone knows the equivalence syntax? VS 2010 will show __attribute__ has no storage class of this type, and mode is not defined. I searched on the Internet and it said Equivalent to __attribute__( aligned( size ) ) in GCC It is helpful for former unix developers or people writing code that works on multiple platforms that in GCC you achieve the same results using attribute ( aligned( ... ) ) See here for more information: http:/

Why is prefetch speedup not greater in this example?

阅读更多关于 Why is prefetch speedup not greater in this example?

问题 In 6.3.2 of this this excellent paper Ulrich Drepper writes about software prefetching. He says this is the "familiar pointer chasing framework" which I gather is the test he gives earlier about traversing randomized pointers. It makes sense in his graph that performance tails off when the working set exceeds the cache size, because then we are going to main memory more and more often. But why does prefetch help only 8% here? If we are telling the processor exactly what we want to load, and

Constexpr and SSE intrinsics

阅读更多关于 Constexpr and SSE intrinsics

问题 Most C++ compilers support SIMD(SSE/AVX) instructions with intrisics like _mm_cmpeq_epi32 My problem with this is that this function is not marked as constexpr , although "semantically" there is no reason for this function to not be constexpr since it is a pure function. Is there any way I could write my own version of (for example) _mm_cmpeq_epi32 that is constexpr ? Obviously I would like that the function at runtime uses the proper asm, I know I can reimplement any SIMD function with slow

Questions regarding operations on NaN

阅读更多关于 Questions regarding operations on NaN

问题 My SSE-FPU generates the following NaNs: When I do a any basic dual operation like ADDSD, SUBSD, MULSD or DIVSD and one of both operands is a NaN, the result has the sign of the NaN-operand and the lower 51 bits of the mantissa of the result is loaded with the lower 51 bits of the mantissa of the NaN-operand. When both operations are NaN, the result is loaded with the sign of the destination-register and the lower 51 bits of the result-mantissa is loaded with the lower 51 bits of the

SSE42 & STTNI - PcmpEstrM is twice slower than PcmpIstrM, is it true?

阅读更多关于 SSE42 & STTNI - PcmpEstrM is twice slower than PcmpIstrM, is it true?

问题 I'm experimenting with SSE42 and STTNI instructions and have got strange result - PcmpEstrM (works with explicit length strings) runs twice slower than PcmpIstrM (implicit length strings). On my i7 3610QM the difference is 2366.2 ms vs. 1202.3 ms - 97% . On i5 3470 difference is not so huge, but is still significant = 3206.2 ms vs. 2623.2 ms - 22% . Both are "Ivy Bridge" - it is strange that they have so different "difference" (at least i can't see any technical differences in their specs -

Find the first instance of a character using simd

阅读更多关于 Find the first instance of a character using simd

问题 I am trying to find the first instance of a character, in this case '"' using simd (AVX2 or earlier). I'd like to use _mm256_cmpeq_epi8, but then I need a quick way of finding if any of the resulting bytes in the __m256i have been set to 0xFF. The plan was then to use _mm256_movemask_epi8 to convert the result from bytes to bits, and the to use ffs to get a matching index. Is it better to move out a portion at a time using _mm_movemask_epi8? Any other suggestions? 回答1: You have the right idea

Is __int128_t arithmetic emulated by GCC, even with SSE?

阅读更多关于 Is __int128_t arithmetic emulated by GCC, even with SSE?

问题 I've heard that the 128-bit integer data-types like __int128_t provided by GCC are emulated and therefore slow. However, I understand that the various SSE instruction sets (SSE, SSE2, ..., AVX) introduced at least some instructions for 128-bit registers. I don't know very much about SSE or assembly / machine code, so I was wondering if someone could explain to me whether arithmetic with __int128_t is emulated or not using modern versions of GCC. The reason I'm asking this is because I'm

Is __int128_t arithmetic emulated by GCC, even with SSE?

阅读更多关于 Is __int128_t arithmetic emulated by GCC, even with SSE?

FLT_EPSILON for a nth root finder with SSE/AVX

阅读更多关于 FLT_EPSILON for a nth root finder with SSE/AVX

问题 I'm trying to convert a function that finds the nth root in C for a double value from the following link http://rosettacode.org/wiki/Nth_root#C to find the nth root for 8 floats at once using AVX. Part of that code uses DBL_EPSILON * 10. However, when I convert this to use float with AVX I have to use FLT_EPSILON*1000 or the code hangs and does not converge. When I print out FLT_EPSILON I see it is order 1E-7. But this link, http://www.cplusplus.com/reference/cfloat/ , says it should be 1E-5.

Grayscale bilinear patch extraction - SSE optimization

阅读更多关于 Grayscale bilinear patch extraction - SSE optimization

问题 My program makes an intensive use of small sub-images extracted using bilinear interpolation from larger grayscale images. I am using the following function for this purpose: bool extract_patch_bilin(const cv::Point2f &patch_ctr, const cv::Mat_<uchar> &img, cv::Mat_<uchar> &patch) { const int hsize = patch.rows/2; // ... // Precondition checks: patch is a preallocated square matrix and both patch and image have continuous buffers // ... int floorx=(int)floor(patch_ctr.x)-hsize, floory=(int