sse

Array Error - Access violation reading location 0xffffffff

风格不统一 提交于 2019-12-05 23:15:47
I have previously used SIMD operators to improve the efficiency of my code, however I am now facing a new error which I cannot resolve. For this task, speed is paramount. The size of the array will not be known until the data is imported, and may be very small (100 values) or enormous (10 million values). For the latter case, the code works fine, however I am encountering an error when I use fewer than 130036 array values. Does anyone know what is causing this issue and how to resolve it? I have attached the (tested) code involved, which will be used later in a more complicated function. The

Help me improve some more SSE2 code

烈酒焚心 提交于 2019-12-05 22:01:08
I am looking for some help to improve this bilinear scaling sse2 code on core2 cpus On my Atom N270 and on an i7 this code is about 2x faster than the mmx code. But under core2 cpus it is only equal to the mmx code. Code follows void ConversionProcess::convert_SSE2(BBitmap *from, BBitmap *to) { uint32 fromBPR, toBPR, fromBPRDIV4, x, y, yr, xr; ULLint start = rdtsc(); ULLint stop; if (from && to) { uint32 width, height; width = from->Bounds().IntegerWidth() + 1; height = from->Bounds().IntegerHeight() + 1; uint32 toWidth, toHeight; toWidth = to->Bounds().IntegerWidth() + 1; toHeight = to-

Websocket data unmasking / multi byte xor

僤鯓⒐⒋嵵緔 提交于 2019-12-05 21:15:31
websocket spec defines unmasking data as j = i MOD 4 transformed-octet-i = original-octet-i XOR masking-key-octet-j where mask is 4 bytes long and unmasking has to be applied per byte. Is there a way to do this more efficiently, than to just loop bytes? Server running the code can assumed to be a Haswell CPU, OS is Linux with kernel > 3.2, so SSE etc are all present. Coding is done in C, but I can do asm as well if necessary. I'd tried to look up the solution myself, but was unable to figure out if there was an appropriate instruction in any of the dozens of SSE1-5/AVE/(whatever extension -

x86 CPU Dispatching for SSE/AVX in C++

天大地大妈咪最大 提交于 2019-12-05 20:53:58
I have an algorithm which benefits from hand optimisation with SSE(2) intrinsics. Moreover, the algorithm will also be able to benefit from the 256-bit AVX registers in the future. My question is what is the best way to Register the availability variants of my class at compile time; so if my classes are, say: Foo , FooSSE2 and FooAVX I require a means of determining at runtime what classes are compiled in. Determine the capabilities of the current CPU. At the lowest level this will result in a cpuid call. Decide at runtime what to use based on what is compiled and what is supported. While I

“Extend” data type size in SSE register

此生再无相见时 提交于 2019-12-05 18:36:05
I'm using VS2005 (at work) and need an SSE intrinsic that does the following: I have a pre-existing __m128i n filled with 16 bit integers a_1,a_2,....,a_8 . Since some calculations that I now want to do require 32 instead of 16 bits, I want to extract the two four-sets of 16-bit integers from n and put them into two separated __m128i s which contain a_1,...,a_4 and a_5,...,a_8 respectively. I could do this manually using the various _mm_set intrinsics, but those would result in eight mov s in assembly, and I'd hoped that there would be a faster way to do this. Assuming that I understand

Horizontal trailing maximum on AVX or SSE

落花浮王杯 提交于 2019-12-05 18:16:27
I have an __m256i register consisting of 16bit values and I want to get the maximum values on each trailing element which are zeroes. To give an example: input: 1 0 0 3 0 0 4 5 0 0 0 0 4 3 0 2 output: 1 1 1 3 3 3 4 5 5 5 5 5 4 3 3 2 Are there any efficient way of doing this on AVX or AVX architecture? Maybe with log(16) = 4 iterations? Addition: Any solution on 128 bit numbers with 8 uint_16's in it are appreciated also. You can do this in log_2(SIMD_width) steps indeed. The idea is to shift the input vector x_vec two bytes. Then we blend x_vec with the shifted vector such that x_vec is

Floating-point number vs fixed-point number: speed on Intel I5 CPU

岁酱吖の 提交于 2019-12-05 17:52:55
I have a C/C++ program which involves intensive 32-bit floating-point matrix math computations such as addition, subtraction, multiplication, division, etc. Can I speed up my program by converting 32-bit floating-point numbers into 16-bit fixed-point numbers ? How much speed gain can I get ? Currently I'm working on a Intel I5 CPU. I'm using Openblas to perform the matrix calculations. How should I re-implement Openblas functions such as cblas_dgemm to perform fixed-point calculations ? I know that SSE(Simple SIMD Extensions) operates on 4x32=8x16=128 bit data at one time, i.e., 4 32-bit

performance of intrinsic functions with sse

∥☆過路亽.° 提交于 2019-12-05 17:37:34
I am currently getting started with SSE. The answer to my previous question regarding SSE ( Mutiplying vector by constant using SSE ) brought me to the idea to test the difference between using intrinsics like _mm_mul_ps() and just using 'normal operators' (not sure what the best term is) like * . So i wrote two testing cases which only differ in way the result is calculated: Method 1: int main(void){ float4 a, b, c; a.v = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f); b.v = _mm_set_ps(-1.0f, -2.0f, -3.0f, -4.0f); printf("method 1\n"); c.v = a.v + b.v; // <--- print_vector(a); print_vector(b); printf("1

determinant calculation with SIMD

▼魔方 西西 提交于 2019-12-05 17:18:38
Does there exist an approach for calculating the determinant of matrices with low dimensions (about 4), that works well with SIMD (neon, SSE, SSE2)? I am using a hand-expansion formula, which does not work so well. I am using SSE all the way to SSE3 and neon, both under linux. The matrix elements are all floats. Here's my 5 cents. determinant of a 2x2 matrix: that's an exercise for the reader, should be simple to implement determinant of a 3x3 matrix: use the scalar triple product. This will require smart cross() and dot() implementations. The recipes for these are widely available.

How to control whether C math uses SSE2?

依然范特西╮ 提交于 2019-12-05 12:32:37
问题 I stepped into the assembly of the transcendental math functions of the C library with MSVC in fp:strict mode. They all seem to follow the same pattern, here's what happens for sin . First there is a dispatch routine from a file called "disp_pentium4.inc". It checks if the variable ___use_sse2_mathfcns has been set; if so, calls __sin_pentium4 , otherwise calls __sin_default . __sin_pentium4 (in "sin_pentium4.asm") starts by transferring the argument from the x87 fpu to the xmm0 register,