问题
I just wrote my first OpenMP program that parallelizes a simple for loop. I ran the code on my dual core machine and saw some speed up when going from 1 thread to 2 threads. However, I ran the same code on a school linux server and saw no speed-up. After trying different things, I finally realized that removing some useless printf statements caused the code to have significant speed-up. Below is the main part of the code that I parallelized:
#pragma omp parallel for private(i)
for(i = 2; i <= n; i++)
{
printf("useless statement");
prime[i-2] = is_prime(i);
}
I guess that the implementation of printf has significant overhead that OpenMP must be duplicating with each thread. What causes this overhead and why can OpenMP not overcome it?
回答1:
Speculating, but maybe the stdout is guarded by a lock?
In general, printf is an expensive operation because it interacts with other resources (such as files, the console and such).
My empirical experience is that printf is very slow on a Windows console, comparably much faster on Linux console but fastest still if redirected to a file or /dev/null.
I've found that printf-debugging can seriously impact the performance of my apps, and I use it sparingly.
Try running your application redirected to a file or to /dev/null to see if this has any appreciable impact; this will help narrow down where the problem lays.
Of course, if the printfs are useless, why are they in the loop at all?
回答2:
To expand a bit on @Will's answer ...
I don't know whether stdout is guarded by a lock, but I'm pretty sure that writing to it is serialised at some point in the software stack. With the printf
statements included OP is probably timing the execution of a lot of serial writes to stdout, not the parallelised execution of the loop.
I suggest OP modifies the printf
statement to include i
, see what happens.
As for the apparent speed-up on the dual-core machine -- was it statistically significant ?
回答3:
You have here a parallel for loop, but the scheduling is unspecified.
#pragma omp parallel for private(i)
for(i = 2; i <= n; i++)
There are some scheduling types defined in OpenMP 3.0 standard. They can be changed by setting OMP_SCHEDULE
environment variable to type[,chunk]
where
- type is one of static, dynamic, guided, or auto
- chunk is an optional positive integer that specifies the chunk size
Another way of changing schedule kind is calling openmp function omp_set_schedule
The is_prime
function can be rather fast. /I suggest/
prime[i-2] = is_prime(i);
So, the problem can came from wrong scheduling mode, when a little number is executed before barrier from scheduling.
And the printf
have 2 parts inside it /I consider glibc as popular Linux libc implementation/
- Parse the format string and put all parameters into buffer
- Write buffer to file descriptor (to FILE buffer, as stdout is buffered by glibc by default)
The first part of printf can be done in parallel, but second part is a critical section and it is locked with _IO_flockfile
.
回答4:
What were your timings - was it much slower with the printf's? In some tight loops the printf's might take a large fraction of the total computing time; for example if is_prime() is pretty fast, and therefore the performance is determined more by the number of calls to printf than the number of (parallelized) calls to is_prime().
来源:https://stackoverflow.com/questions/2711456/no-speed-up-with-useless-printfs-using-openmp