问题
I'm trying make this program run with multiple threads.
#include <stdio.h>
#include <time.h>
#include <omp.h>
#define NUM_THREADS 4
static long num_steps = 1000000000;
int main()
{
int i;
double x, pi, sum = 0.0;
double step = 1.0/(double)num_steps;
clock_t start = clock(), diff;
#pragma omp parallel for num_threads(NUM_THREADS) reduction (+:sum)
for (i = 0; i < num_steps; i++)
{
x = (i+0.5)*step;
sum += 4.0/(1.0 + x*x);
}
#pragma omp ordered
pi = step*sum;
printf("pi = %.15f\n %d iterations\n", pi, num_steps);
diff = clock() - start;
int msec = diff * 1000 / CLOCKS_PER_SEC;
printf("Time taken %d seconds %d milliseconds", msec/1000, msec%1000);
return 0;
}
by adding the #pragma omp parallel for num_threads(NUM_THREADS) reduction (+:sum)
. I also have #pragma omp ordered
after the for loop which I don't think I actually need because no thread should continue until all threads are done with the for loop. Is this correct? Would this also be the reason why I only get about a second increase in performance compared to running this as a single threaded program? It's 6 seconds compared to 7 seconds for me.
The thing I can't answer is why does this program give me a different answer for pi every time I run it?
回答1:
Aside from the bugs pointed out by Gilles, there's a more fundamental issue here.
Reduction across the parallel threads need not be deterministic. The order in which the per-thread contributions are combined can change with each execution of the code. If you don't know why that matters, please go and read "What Every Computer Scientist Should Know About Floating-Point Arithmetic"
In case you didn't get the point, yet, consider three threads doing a sum reduction on a decimal arithmetic machine which supports three digits of precision. Suppose we're accumulating the set (100, -100, 0.1) if we add them in that order we'll have 100 - 100 = 0 + 0.1 = 0.1, however if we add them in the order (100, 0.1, -100), we'll get 100 + 0.1 = 100 (three significant figures, remember!) -100 == 0
If you're using the Intel compiler, there is an environment variable you can set to request deterministic reductions (KMP_DETERMINISTIC_REDUCTION
), however that only enforces determinism when the same number of threads are used. It does not enforce it between runs with different numbers of threads. (Doing that would require enforcing an order on the accumulation of the per-thread contributions, which would require different code-generation and some inter-thread synchronization).
回答2:
Your problem comes from the fact that you forgot to declare x
private
.
If you change your OpenMP directive into:
#pragma omp parallel for num_threads(NUM_THREADS) reduction(+:sum) private(x)
your code becomes valid.
However, there are still two issues in here:
- The
#pragma omp ordered
makes no sense, since you are not into aparallel
region. You should remove it. - Using
clock()
for measuring times in a multi-threaded code is hazardous, not because the function isn't thread-safe, but because it returns the CPU time of the current threads and its children's, not the elapsed time. Therefore you often end-up with results almost identical with and without OpenMP activated, and people wondering why their code doesn't expose any speed-up... So unless you have a very good reason to useclock()
, useomp_get_wtime()
instead.
来源:https://stackoverflow.com/questions/33193620/why-is-this-openmp-program-giving-me-different-answers-every-time