I am new to optimizing code with SSE/SSE2 instructions and until now I have not gotten very far. To my knowledge a common SSE-optimized function would look like this:
<
EDIT: casting to long is a cheap way to protect oneself against the most likely possibility of int and pointers being different sizes nowadays.
As pointed out in the comments below, there are better solutions if you are willing to include a header...
A pointer p is aligned on a 16-byte boundary iff ((unsigned long)p & 15) == 0.