Trying out std::tr1::array on a mac i\'m getting 16 byte alignment.
sizeof(int) = 4;
sizeof( std::tr1::array< int,3 > ) = 16;
sizeof
It looks from what little data you've given like it allocates memory to the nearest power of two. Knowing very little CPU architecture details, I might guess that allocating power-of-two sizes is faster than non padded, at least for small amounts. Perhaps you should see what happens when you try to allocate something a much larger?
Is there any reason you absolutely positively need to skim those extra bytes off the top?