I want to repeatedly zero a large 2d array in C. This is what I do at the moment:
// Array of size n * m, where n may not equal m
for(j = 0; j < n; j++)
{
If you are really, really obsessed with speed (and not so much with portability) I think the absolute fastest way to do this would be to use SIMD vector intrinsics. e.g. on Intel CPUs, you could use these SSE2 instructions:
__m128i _mm_setzero_si128 (); // Create a quadword with a value of 0.
void _mm_storeu_si128 (__m128i *p, __m128i a); // Write a quadword to the specified address.
Each store instruction will set four 32-bit ints to zero in one hit.
p must be 16-byte aligned, but this restriction is also good for speed because it will help the cache. The other restriction is that p must point to an allocation size that is a multiple of 16-bytes, but this is cool too because it allows us to unroll the loop easily.
Have this in a loop, and unroll the loop a few times, and you will have a crazy fast initialiser:
// Assumes int is 32-bits.
const int mr = roundUpToNearestMultiple(m, 4); // This isn't the optimal modification of m and n, but done this way here for clarity.
const int nr = roundUpToNearestMultiple(n, 4);
int i = 0;
int array[mr][nr] __attribute__ ((aligned (16))); // GCC directive.
__m128i* px = (__m128i*)array;
const int incr = s >> 2; // Unroll it 4 times.
const __m128i zero128 = _mm_setzero_si128();
for(i = 0; i < s; i += incr)
{
_mm_storeu_si128(px++, zero128);
_mm_storeu_si128(px++, zero128);
_mm_storeu_si128(px++, zero128);
_mm_storeu_si128(px++, zero128);
}
There is also a variant of _mm_storeu
that bypasses the cache (i.e. zeroing the array won't pollute the cache) which could give you some secondary performance benefits in some circumstances.
See here for SSE2 reference: http://msdn.microsoft.com/en-us/library/kcwz153a(v=vs.80).aspx