Is it a good idea to vectorize the code? What are good practices in terms of when to do it? What happens underneath?
Maybe also have a look at libSIMDx86 (source code).
A nice example well explained is:
Choosing to Avoid Branches: A Small Altivec Example