sse | 易学教程

Is the SSE unaligned load intrinsic any slower than the aligned load intrinsic on x64_64 Intel CPUs?

阅读更多关于 Is the SSE unaligned load intrinsic any slower than the aligned load intrinsic on x64_64 Intel CPUs?

I'm considering changing some code high performance code that currently requires 16 byte aligned arrays and uses _mm_load_ps to relax the alignment constraint and use _mm_loadu_ps . There are a lot of myths about the performance implications of memory alignment for SSE instructions, so I made a small test case of what should be a memory-bandwidth bound loop. Using either the aligned or unaligned load intrinsic, it runs 100 iterations through a large array, summing the elements with SSE intrinsics. The source code is here. https://gist.github.com/rmcgibbo/7689820 The results on a 64 bit Macbook

Do sse instructions consume more power/energy?

阅读更多关于 Do sse instructions consume more power/energy?

Very simple question, probably difficult answer: Does using SSE instructions for example for parallel sum/min/max/average operations consume more power than doing any other instructions (e.g. a single sum)? For example, on Wikipedia I couldn't find any information in this respect. The only hint of an answer I could find is here , but it's a little bit generic and there is no reference to any published material in this respect. Mysticial I actually did a study on this a few years ago. The answer depends on what exactly your question is: In today's processors, power consumption is not much

Why does _mm_stream_ps produce L1/LL cache misses?

阅读更多关于 Why does _mm_stream_ps produce L1/LL cache misses?

I'm trying to optimize a computation-intensive algorithm and am kind of stuck at some cache problem. I have a huge buffer which is written occasionally and at random and read only once at the end of the application. Obviously, writing into the buffer produces lots of cache misses and besides pollutes the caches which are afterwards needed again for computation. I tried to use non-temporal move instrinsics, but the cache misses (reported by valgrind and supported by runtime measurements) still occur. However, to further investigate non-temporal moves, I wrote a little test program, which you

Aligned types and passing arguments by value

阅读更多关于 Aligned types and passing arguments by value

Passing aligned types or structures with aligned types by value doesn't work with some implementations. This breaks STL containers, because some of the methods (such as resize) take their arguments by value. I run some tests with Visual Studio 2008 and not entirely sure when and how pass by value fails. My main concern is function foo . It seems to work fine, but could it be a result of inlining or some other coincidence? What if I change its signature to void foo(const __m128&) ? Your input is greatly appreciated. Thank you. struct A { __m128 x; int n; }; void foo(__m128); void bar(A); void

Should I use SIMD or vector extensions or something else?

阅读更多关于 Should I use SIMD or vector extensions or something else?

I'm currently develop an open source 3D application framework in c++ (with c++11 ). My own math library is designed like the XNA math library , also with SIMD in mind. But currently it is not really fast, and it has problems with memory alignes, but more about that in a different question. Some days ago I asked myself why I should write my own SSE code. The compiler is also able to generate high optimized code when optimization is on. I can also use the " vector extension " of GCC . But this all is not really portable. I know that I have more control when I use my own SSE code, but often this

Can't get over 50% max. theoretical performance on matrix multiply

阅读更多关于 Can't get over 50% max. theoretical performance on matrix multiply

Problem I am learning about HPC and code optimization. I attempt to replicate the results in Goto's seminal matrix multiplication paper ( http://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf ). Despite my best efforts, I cannot get over ~50% maximum theoretical CPU performance. Background See related issues here ( Optimized 2x2 matrix multiplication: Slow assembly versus fast SIMD ), including info about my hardware What I have attempted This related paper ( http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf ) has a good description of Goto's algorithmic structure.

Is my understanding of AoS vs SoA advantages/disadvantages correct?

阅读更多关于 Is my understanding of AoS vs SoA advantages/disadvantages correct?

I've recently been reading about AoS vs SoA structure design and data-oriented design . It's oddly difficult to find information about either, and what I have found seems to assume greater understanding of processor functionality than I possess. That said, what I do understand about the former topic in particular leads to some questions that I think I should be able to understand the answers to. Firstly, to make sure I am not basing my understanding off of a false premise, my understanding of the functionality and pros and cons of AoS vs SoA, as applied to a collection of 'Person' records with

How to load unsigned ints into SIMD

阅读更多关于 How to load unsigned ints into SIMD

问题 I have a C program where I have a few arrays of unsigned ints. I'm using this declaration uint32_t . I want to use SIMD to perform some operations on the data stored in each of the arrays. This is where I'm stuck because it looks like most of the SSE and SSE2 functions only support float and double. What's the best way for me to load data of type uint32_t ? 回答1: For any integer SSE type you typically use _mm_load_si128 / _mm_loadu_si128 : uint32_t a[N]; __m128i v = _mm_loadu_si128((__m128i *

Common SIMD techniques

阅读更多关于 Common SIMD techniques

Where can I find information about common SIMD tricks? I have an instruction set and know, how to write non-tricky SIMD code, but I know, SIMD now is much more powerful. It can hold complex conditional branchless code. For example ( ARMv6 ), the following sequence of instructions sets each byte of Rd equal to the unsigned minimum of the corresponding bytes of Ra and Rb: USUB8 Rd, Ra, Rb SEL Rd, Rb, Ra Links to tutorials / uncommon SIMD techniques are good too :) ARMv6 is the most interesting for me, but x86 (SSE,...)/ Neon (in ARMv7)/others are good too. One of the best SIMD resources ever was

custom extended vector type: e.g. float4 b = v.xxyz;

阅读更多关于 custom extended vector type: e.g. float4 b = v.xxyz;

问题 OpenCL, GCC, and Clang have convinent vector type extensions. One of the features I like the best is the ability to do a swizzle like this: float4 a(1,2,3,4); float4 b = a.xxyw; //1124 How can I make my own vector extensions to do this with e.g. MSVC as well? The best I have come up with is something that would do float4 b = a.xxyw() (see the code below) . So my main question is how it would be possible to do this without the () notation. In case anyone is interested I came up with some code