I am new to optimizing code with SSE/SSE2 instructions and until now I have not gotten very far. To my knowledge a common SSE-optimized function would look like this:
How about:
void *mem = malloc(1024+15); void *ptr =( (*(char*)mem) - (*(char *)mem % 16) );