I have recently downloaded and installed the Intel C++ compiler, Composer XE 2013, for Linux which is free to use for non-commercial development. http://software.intel.com/e
Two points:
(1) It appears you are using intel intrinsics in your code -- g++ and icpc do not necessarily implement the same intrinsics (but most of them overlap). Check the header files that need to be imported (g++ may need the hint to define the inartistic for you). Does g++ give an error message when it fails?
(2) The compiler flags do does not mean that instructions will be generated (from icpc --help):
-msse3 May generate Intel(R) SSE3, SSE2, and SSE instructions
These flags are usually just hints to the compiler. You may want to look at -xHost and -fast.
It seems no matter what options I try it compiles but does not make optimal use of the AVX code.
How have you checked this? You may not see a 4x speedup if there are other bottlenecks (such as memory bandwidth).
EDIT (based on question edits):
It looks like icc scalar is faster than gcc scalar -- it is possible that icc is vectorizing the scalar code. If this is the case, I would not expect a 4x speedup from icc when manually coding the vectorization.
As far the the difference between icc at 5.782332s and gcc at 3.509130s (for nvec 5000000); this is unexpected. I cannot tell based on the information I have what why there is a difference in the runtime between the two compilers. I would recommend looking at the emitted code (http://www.delorie.com/djgpp/v2faq/faq8_20.html) from both compilers. Also, make sure that your measurements are reproducible (e.g. memory layout on multi-socket machines, hot/cold caches, background processes, etc.).