问题
I try using OpenMP to parallel some for-loop of my program but failed to get significant speed improvement (actual degradation is observed). My target machine will have 4-6 cores and I currently rely on the OpenMP runtime to get the thread count for me, so I haven't tried any threadcount combination yet.
- Target/Development platform: Windows 64bits
- using MinGW64 4.7.2 (rubenvb build)
Sample output with OpenMP
Thread count: 4
Dynamic :0
OMP_GET_NUM_PROCS: 4
OMP_IN_PARALLEL: 1
5.612 // <- returned by omp_get_wtime()
5.627 (sec) // <- returned by clock()
Wall time elapsed: 5.62703
Sample output without OpenMP
2.415 (sec) // <- returned by clock()
Wall time elapsed: 2.415
How I measure the time
struct timeval start, end;
gettimeofday(&start, NULL);
#ifdef _OPENMP
double t1 = (double) clock();
double wt = omp_get_wtime();
sim->resetEnvironment(run);
tout << omp_get_wtime() - wt << std::endl;
timeEnd(tout, t1);
#else
double = (double) clock();
sim->resetEnvironment(run);
timeEnd(tout, t1);
#endif
gettimeofday(&end, NULL);
tout << "Wall time elapsed: "
<< ((end.tv_sec - start.tv_sec) * 1000000u + (end.tv_usec - start.tv_usec)) / 1.e6
<< std::endl;
The code
void Simulator::resetEnvironment(int run)
{
#pragma omp parallel
{
// (a)
#pragma omp for schedule(dynamic)
for (size_t i = 0; i < vector_1.size(); i++) // size ~ 20
reset(vector_1[i]);
#pragma omp for schedule(dynamic)
for (size_t i = 0; i < vector_2.size(); i++) // size ~ 2.3M
reset(vector_2[i]);
#pragma omp for schedule(dynamic)
for (size_t i = 0; i < vector_3.size(); i++) // size ~ 0.3M
reset(vector_3[i]);
for (int level = 0; level < level_count; level++) // (b) level = 3
{
#pragma omp for schedule(dynamic)
for (size_t i = 0; i < vector_4[level].size(); i++) // size ~500 - 1K
reset(vector_4[level][i]);
}
#pragma omp for schedule(dynamic)
for (long i = 0; i < populationSize; i++) // size ~7M
resetAgent(agents[i]);
} // end #parallel
} // end: Simulator::resetEnvironment()
Randomness Inside reset() function calls, I used a RNG for seeding some agents for subsequent tasks. Below is my RNG implementation, as I saw suggestion that using one RNG per per-thread for thread-safety.
class RNG {
public:
typedef std::mt19937 Engine;
RNG()
: real_uni_dist_(0.0, 1.0)
#ifdef _OPENMP
, engines()
#endif
{
#ifdef _OPENMP
int threads = std::max(1, omp_get_max_threads());
for (int seed = 0; seed < threads; ++seed)
engines.push_back(Engine(seed));
#else
engine_.seed(time(NULL));
#endif
} // end_ctor(RNG)
/** @return next possible value of the uniformed distribution */
double operator()()
{
#ifdef _OPENMP
return real_uni_dist_(engines[omp_get_thread_num()]);
#else
return real_uni_dist_(engine_);
#endif
}
private:
std::uniform_real_distribution<double> real_uni_dist_;
#ifdef _OPENMP
std::vector<Engine> engines;
#else
std::mt19937 engine_;
#endif
}; // end_class(RNG)
Question:
- at (a), is it good to not using shortcut 'parallel for' to avoid the overhead of creating teams?
- which part of my implementation can be the cause of degradation of performance?
- Why the time reported by clock() and omp_get_wtime() are so similar, as I expected clock() would be somehow longer than omp_get_wtime()
[Edit]
- at (b), my intention of including OpenMP directive in the inner loop is that the iteration for outer loop is so small (only 3) so I think I can skip that and go directly to the inner loop of looping the vector_4[level]. Is this thought inappropriate (or will this instruct the OpenMP to repeat the outer loop by 4 and hence actually looping the inner loop 12 instead of 3 (say the current thread count is 4)?
Thanks
回答1:
If the measured wall-clock time (as reported by omp_get_wtime()
) is close to the total CPU time (as reported by clock()
), this could mean several different things:
- the code is running single-threaded, but then the total CPU time will be lower than the wall-clock time;
- a very high synchronisation and cache coherency overhead is present and it is huge in comparison to the actual work being done by the threads.
Your case is the second one and the reason is that you use schedule(dynamic)
. Dynamic scheduling should only be used in cases when each iteration can take a varying amount of time. If such iterations are statically distributed among the threads, work imbalance could occur. schedule(dynamic)
takes care of this by giving each task (in your case each single iteration of the loop) to the next thread to finish its work and become idle. There is a certain overhead in synchronising the threads and bookkeeping the distribution of the work items and therefore it should only be used when the amount of work per thread is huge in comparison to the overhead. OpenMP allows you to group more iterations into iteration blocks and this number is specified like schedule(dynamic,100)
- this would make each thread execute a block (or chunk) of 100 consecutive iterations before asking for a new one. The default block size for dynamic scheduling is 1, i.e. each vector element in processed by a separate thread. I have no idea how much processing is done in reset()
and what kind of elements are there in vector_*
, but given the serial run time it is not much at all.
Another source of slowdown is the loss of data locality when you use dynamic scheduling. Depending on the type of elements of those vectors, processing neighbouring elements by different threads leads to false sharing. That means that, e.g. vector_1[i]
lies in the same cache line with some other elements of vector_1
, e.g. vector_1[i-1]
and vector_1[i+1]
. When thread 1 modifies vector_1[i]
, the cache line is reloaded in all other cores that work on the neighbouring elements. If vector_1[]
is only written to, the compiler can be smart enough to generate non-temporal stores (those bypass the cache) but it only works with vector stores and having each core do a single iteration at a time means no vectorisation at all. Data locality can be improved by either switching to static scheduling or, if reset()
really takes varying amount of time, by setting a reasonable chunk size in the schedule(dynamic)
clause. The best chunk size is usually dependent on the processor and often one has to tweak it in order to get the best performance.
So I would strongly suggest that you first switch to static scheduling by replacing all schedule(dynamic)
to schedule(static)
and then try to optimise further. You don't have to specify the chunk size in the static case as the default is simply the total number of iterations divided by the number of threads, i.e. each thread would get one contiguous block of iterations.
回答2:
to answer your question:
1) in a) the usage of the "parallel" keyword is exact
2) Congrats, your impl of your lok-free PRNG looks fine
3) the error can come from all the OpenMP pragma you use in the inner loop . Parallel at the top level and avoid fine-grain and inner loop parallelism
4) In the code below, i used 'nowait' on each 'omp for', I put the omp directive out-of-the-loop in the vector_4 proccessing and put a barrier at the end to join all the thread and wiat for the end of all the job we spawn before !
// pseudo code
#pragma omp for schedule(dynamic) nowait
for (size_t i = 0; i < vector_1.size(); i++) // size ~ 20
reset(vector_1[i]);
#pragma omp for schedule(dynamic) nowait
for (size_t i = 0; i < vector_2.size(); i++) // size ~ 2.3M
reset(vector_2[i]);
#pragma omp for schedule(dynamic) nowait
for (size_t i = 0; i < vector_3.size(); i++) // size ~ 0.3M
reset(vector_3[i]);
#pragma omp for schedule(dynamic) nowait
for (int level = 0; level < level_count; level++)
{
for (size_t i = 0; i < vector_4[level].size(); i++) // size ~500 - 1K
reset(vector_4[level][i]);
}
#pragma omp for schedule(dynamic) nowait
for (long i = 0; i < populationSize; i++) // size ~7M
resetAgent(agents[i]);
#pragma omp barrier
回答3:
A single threaded program will run faster than a multi-threaded one if the useful processing time is less than the overhead incurred by threads.
It is a good idea to determine what the overhead is by implementing a null function and then deciding whether it is a better solution.
From a performance point of view, threads are only useful if the useful processing time is significantly higher than the overhead that is incurred by threads and there are real cpus available to run the threads.
来源:https://stackoverflow.com/questions/17417444/c-openmp-slower-than-serial-with-default-thread-count