I\'ve read some other questions on this topic. However, they didn\'t solve my problem anyway.
I wrote the code as following and I got pthread version an
There is nothing wrong with OpenMP in your case. What is wrong is the way you measure the elapsed time.
Using clock() to measure the performance of multithreaded applications on Linux (and most other Unix-like OSes) is a mistake since it does not return the wall-clock (real) time but instead the accumulated CPU time for all process threads (and on some Unix flavours even the accumulated CPU time for all child processes). Your parallel code shows better performance on Windows since there clock() returns the real time and not the accumulated CPU time.
The best way to prevent such discrepancies is to use the portable OpenMP timer routine omp_get_wtime():
double start = omp_get_wtime();
#pragma omp parallel for
for(int n = 0; n < sizen; ++n)
sinTable[n] = std::sin(2 * M_PI * n / sizen);
double finish = omp_get_wtime();
printf("from omp: %lf\n", finish - start);
For non-OpenMP applications, you should use clock_gettime() with the CLOCK_REALTIME clock:
struct timespec start, finish;
clock_gettime(CLOCK_REALTIME, &start);
#pragma omp parallel for
for(int n = 0; n < sizen; ++n)
sinTable[n] = std::sin(2 * M_PI * n / sizen);
clock_gettime(CLOCK_REALTIME, &finish);
printf("from omp: %lf\n", (finish.tv_sec + 1.e-9 * finish.tv_nsec) -
(start.tv_sec + 1.e-9 * start.tv_nsec));
The Linux scheduler, in the absence of any information, will tend to schedule threads in a process on the same core so that they are served by the same cache and memory bus. It has no way of knowing that your threads will be accessing different memory so won't be hurt instead of helped by being on different cores.
Use the sched_setaffinity function to set each thread to a different core mask.
WARNING: tho answer is controversial. The trick described below is implementation dependent and can lead to a decrease of performance. Still, it might increase it as well. I strongly recommend to take a look at comments to this answer.
This doesn't really answer the question, but if you alter the way you parallelize your code, you might get a performance boost. Now you do it like this:
#pragma omp parallel for
for(int n = 0; n < sizen; ++n)
sinTable[n] = std::sin(2 * M_PI * n / sizen);
In this case each thread will compute one item. Since you have 2 cores, OpenMP will create two threads by default. To calculate each value a thread would have to:
The first step is rather expensive. And both your threads would have to do it sizen/2 times.
Try to do the following:
int workloadPerThread = sizen / NUM_THREADS;
#pragma omp parallel for
for (int thread = 0; thread < NUM_THREADS; ++thread)
{
int start = thread * workloadPerThread;
int stop = start + workloadPerThread;
if (thread == NUM_THREADS - 1)
stop += sizen % NUM_THREADS;
for (int n = start; n < stop; ++n)
sinTable[n] = std::sin(2 * M_PI * n / sizen);
}
This way your threads will initialize only once.