Is there a simple tutorial for me to get up to speed in SSE, SSE2 and SSE3 in GNU C++? How can you do code optimization in SSE?
MSDN has pretty good description of SSE compiler built-ins (and those built-ins are de-facto standard, they even work in clang/XCode).
The nice thing about that reference is that it shows equivalent pseudocode, so e.g. you can learn that ADDPD instruction is:
r0 := a0 + b0
r1 := a1 + b1
And here's good description of a cryptic shuffle instruction: http://www.songho.ca/misc/sse/sse.html