How to allocate 16byte memory aligned data

问题

I am trying to implement SSE vectorization on a piece of code for which I need my 1D array to be 16 byte memory aligned. However, I have tried several ways to allocate 16byte memory aligned data but it ends up being 4byte memory aligned.

I have to work with the Intel icc compiler. This is a sample code I am testing with:

  #include <stdio.h>
  #include <stdlib.h>

  void error(char *str)
  {
   printf("Error:%s\n",str);
   exit(-1);
  }

  int main()
  {
   int i;
   //float *A=NULL;
   float *A = (float*) memalign(16,20*sizeof(float));

   //align
   // if (posix_memalign((void **)&A, 16, 20*sizeof(void*)) != 0)
   //   error("Cannot align");

    for(i = 0; i < 20; i++)
       printf("&A[%d] = %p\n",i,&A[i]);

        free(A);

         return 0;
   }

This is the output I get:

 &A[0] = 0x11fe010
 &A[1] = 0x11fe014
 &A[2] = 0x11fe018
 &A[3] = 0x11fe01c
 &A[4] = 0x11fe020
 &A[5] = 0x11fe024
 &A[6] = 0x11fe028
 &A[7] = 0x11fe02c
 &A[8] = 0x11fe030
 &A[9] = 0x11fe034
 &A[10] = 0x11fe038
 &A[11] = 0x11fe03c
 &A[12] = 0x11fe040
 &A[13] = 0x11fe044
 &A[14] = 0x11fe048
 &A[15] = 0x11fe04c
 &A[16] = 0x11fe050
 &A[17] = 0x11fe054
 &A[18] = 0x11fe058
 &A[19] = 0x11fe05c

It is 4byte aligned everytime, i have used both memalign, posix memalign. Since I am working on Linux, I cannot use _mm_malloc neither can I use _aligned_malloc. I get a memory corruption error when I try to use _aligned_attribute (which is suitable for gcc alone I think).

Can anyone assist me in accurately generating 16byte memory aligned data for icc on linux platform.

回答1:

The memory you allocate is 16-byte aligned. See:
&A[0] = 0x11fe010
But in an array of float, each element is 4 bytes, so the second is 4-byte aligned.

You can use an array of structures, each containing a single float, with the aligned attribute:

struct x {
    float y;
} __attribute__((aligned(16)));
struct x *A = memalign(...);

回答2:

The address returned by memalign function is 0x11fe010, which is a multiple of 0x10. So the function is doing a right thing. This also means that your array is properly aligned on a 16-byte boundary. What you are doing later is printing an address of every next element of type float in your array. Since float size is exactly 4 bytes in your case, every next address will be equal to the previous one +4. For instance, 0x11fe010 + 0x4 = 0x11FE014. Of course, address 0x11FE014 is not a multiple of 0x10. If you were to align all floats on 16 byte boundary, then you will have to waste 16 / 4 - 1 bytes per element. Double-check the requirements for the intrinsics that you are using.

回答3:

AFAIK, both memalign and posix_memalign are doing their job.

&A[0] = 0x11fe010

This is aligned to 16 byte.

&A[1] = 0x11fe014

When you do &A[1] you are telling the compiller to add one position to a float pointer. It will unavoidably lead to:

&A[0] + sizeof( float ) = 0x11fe010 + 4 = 0x11fe014

If you intend to have every element inside your vector aligned to 16 bytes, you should consider declaring an array of structures that are 16 byte wide.

struct float_16byte
{
    float data;
    float padding[ 3 ];
}
    A[ ELEMENT_COUNT ];

Then you must allocate memory for ELEMENT_COUNT (20, in your example) variables:

struct float_16byte *A = ( struct float_16byte * )memalign( 16, ELEMENT_COUNT * sizeof( struct float_16byte ) );

回答4:

I found this code on Wikipedia:

Example: get a 12bit aligned 4KBytes buffer with malloc()

// unaligned pointer to large area
void *up=malloc((1<<13)-1);
// well aligned pointer to 4KBytes
void *ap=aligntonext(up,12);

where aligntonext() is meant as: 
move p to the right until next well aligned address if
not correct already. A possible implementation is

// PSEUDOCODE assumes uint32_t p,bits; for readability
// --- not typesafe, not side-effect safe
#define alignto(p,bits) (p>>bits<<bits)
#define aligntonext(p,bits) alignto((p+(1<<bits)-1),bits)

回答5:

I personally believe your code is correct and is suitable for Intel SSE code. When you load data into an XMM register, I believe the processor can only load 4 contiguous float data from main memory with the first one aligned by 16 byte.

In short, I believe what you have done is exactly what you want.

来源：https://stackoverflow.com/questions/11084468/how-to-allocate-16byte-memory-aligned-data

标签

memory

sse

icc