I have what seems to be a very simple parallel for loop, which is just writing zeros to an integer array. But it turns out the more threads, the slower the loop
Besides your error in using the clock function in Linux the rest of your question can be answered by reading these questions/answers.
in-an-openmp-parallel-code-would-there-be-any-benefit-for-memset-to-be-run-in-p/11579987
measuring-memory-bandwidth-from-the-dot-product-of-two-arrays
memset-in-parallel-with-threads-bound-to-each-physical-core
So you should see a significant benefit from multiple threads with memset and doing a reduction even on a single socket system. I have written my own tool to measure bandwidth for this. You can find some of the results from my my i5-4250U (Haswell) with 2 cores below (GCC 4.8, Linux 3.13, EGLIBC 2.19) runing over 1 GB. vsum is your reduction. Notice that there is a significant improvement even on this two core system.
one thread
C standard library
GB time(s) GB/s GFLOPS efficiency
memset: 0.50 0.80 6.68 0.00 inf %
memcpy: 1.00 1.35 7.93 0.00 inf %
Agner Fog's asmlib
GB time(s) GB/s GFLOPS efficiency
memset: 0.50 0.71 7.53 0.00 inf %
memcpy: 1.00 0.93 11.51 0.00 inf %
my_memset
0.50 0.71 7.53 0.00 inf %
FMA3 reduction tests
GB time(s) GB/s GFLOPS efficiency
vsum: 0.50 0.53 10.08 2.52 inf %
vmul: 0.50 0.68 7.93 1.98 inf %
vtriad: 0.50 0.70 7.71 3.85 inf %
dot 1.00 1.08 9.93 2.48 inf %
two threads
C standard library
GB time(s) GB/s GFLOPS efficiency
memset: 0.50 0.64 8.33 0.00 inf %
memcpy: 1.00 1.10 9.76 0.00 inf %
Agner Fog's asmlib
GB time(s) GB/s GFLOPS efficiency
memset: 0.50 0.36 14.98 0.00 inf %
memcpy: 1.00 0.66 16.30 0.00 inf %
my_memset
0.50 0.36 15.03 0.00 inf %
FMA3 tests
standard sum tests with OpenMP: 2 threads
GB time(s) GB/s GFLOPS efficiency
vsum: 0.50 0.41 13.03 3.26 inf %
vmul: 0.50 0.39 13.67 3.42 inf %
vtriad: 0.50 0.44 12.20 6.10 inf %
dot 1.00 0.97 11.11 2.78 inf %
Here is my custom memset function (I have several other tests like this).
void my_memset(int *s, int c, size_t n) {
int i;
__m128i v = _mm_set1_epi32(c);
#pragma omp parallel for
for(i=0; i
Edit:
You should compile with -O3 and -ffast-math. Define the sum outside of the outerloop and then print it out so GCC does not optimize it away. GCC won't auto-vectorize a reduction because floating point arithmetic is not associative and vectorizing the loop could break IEEE floating point rules. Using -ffast-math allows floating point arithemetic to be associative which allows GCC to vectorize the reduction. It should be pointed out that already doing a reduction in OpenMP assumes the floating point arithmetic is associative so it already break IEEE floating point rules.
double sum = 0;
tic();
for(int c = 0; c < COUNT; ++ c) {
#pragma omp parallel for reduction(+:sum)
for(int i = 0; i < sz_i; ++ i)
sum += ptr[i];
}
toc();
printf("sum %f\n", sum);
Edit:
I tested your code and made some modifications. I get faster times with the reduction and memset using multiple threads
max threads: 4
serial reduction
dtime 1.86, sum 705032704
parallel reduction
dtime 1.39 s, sum 705032704
serial memset
dtime 2.95 s
parallel memset
dtime 2.44 s
serial my_memset
dtime 2.66 s
parallel my_memset
dtime 1.35 s
Here is the code I used (g++ foo.cpp -fopenmp -O3 -ffast-math)
#include
#include
#include
#include
#include
#include
void my_memset(int *s, int c, size_t n) {
int i;
__m128i v = _mm_set1_epi32(c);
for(i=0; i vec(sz, 1);
std::cout << "max threads: " << omp_get_max_threads()<< std::endl;
std::cout << "serial reduction" << std::endl;
double dtime;
int sum;
dtime = -omp_get_wtime();
sum = 0;
for(int c = 0; c < COUNT; ++ c) {
for(size_t i = 0; i < sz; ++ i)
sum += vec[i];
}
dtime += omp_get_wtime();
printf("dtime %.2f, sum %d\n", dtime, sum);
int *const ptr = vec.data();
const int sz_i = int(sz); // some OpenMP implementations only allow parallel for with int
std::cout << "parallel reduction" << std::endl;
dtime = -omp_get_wtime();
sum = 0;
for(int c = 0; c < COUNT; ++ c) {
#pragma omp parallel for default(none) reduction(+:sum)
for(int i = 0; i < sz_i; ++ i)
sum += ptr[i];
}
dtime += omp_get_wtime();
printf("dtime %.2f s, sum %d\n", dtime, sum);
std::cout << "serial memset" << std::endl;
dtime = -omp_get_wtime();
for(int c = 0; c < COUNT; ++ c) {
for(size_t i = 0; i < sz; ++ i)
vec[i] = 0;
}
dtime += omp_get_wtime();
printf("dtime %.2f s\n", dtime);
std::cout << "parallel memset" << std::endl;
dtime = -omp_get_wtime();
for(int c = 0; c < COUNT; ++ c) {
#pragma omp parallel for default(none)
for(int i = 0; i < sz_i; ++ i)
ptr[i] = 0;
}
dtime += omp_get_wtime();
printf("dtime %.2f s\n", dtime);
std::cout << "serial my_memset" << std::endl;
dtime = -omp_get_wtime();
for(int c = 0; c < COUNT; ++ c) my_memset(ptr, 0, sz_i);
dtime += omp_get_wtime();
printf("dtime %.2f s\n", dtime);
std::cout << "parallel my_memset" << std::endl;
dtime = -omp_get_wtime();
for(int c = 0; c < COUNT; ++ c) my_memset_omp(ptr, 0, sz_i);
dtime += omp_get_wtime();
printf("dtime %.2f s\n", dtime);
return 0;
}