No speed-up with useless printf's using OpenMP

问题

I just wrote my first OpenMP program that parallelizes a simple for loop. I ran the code on my dual core machine and saw some speed up when going from 1 thread to 2 threads. However, I ran the same code on a school linux server and saw no speed-up. After trying different things, I finally realized that removing some useless printf statements caused the code to have significant speed-up. Below is the main part of the code that I parallelized:

#pragma omp parallel for private(i)
for(i = 2; i <= n; i++)
{
  printf("useless statement");
  prime[i-2] = is_prime(i);
}

I guess that the implementation of printf has significant overhead that OpenMP must be duplicating with each thread. What causes this overhead and why can OpenMP not overcome it?

回答1:

Speculating, but maybe the stdout is guarded by a lock?

In general, printf is an expensive operation because it interacts with other resources (such as files, the console and such).

My empirical experience is that printf is very slow on a Windows console, comparably much faster on Linux console but fastest still if redirected to a file or /dev/null.

I've found that printf-debugging can seriously impact the performance of my apps, and I use it sparingly.

Try running your application redirected to a file or to /dev/null to see if this has any appreciable impact; this will help narrow down where the problem lays.

Of course, if the printfs are useless, why are they in the loop at all?

回答2:

To expand a bit on @Will's answer ...

I don't know whether stdout is guarded by a lock, but I'm pretty sure that writing to it is serialised at some point in the software stack. With the printf statements included OP is probably timing the execution of a lot of serial writes to stdout, not the parallelised execution of the loop.

I suggest OP modifies the printf statement to include i, see what happens.

As for the apparent speed-up on the dual-core machine -- was it statistically significant ?

回答3:

You have here a parallel for loop, but the scheduling is unspecified.

#pragma omp parallel for private(i)
for(i = 2; i <= n; i++)

There are some scheduling types defined in OpenMP 3.0 standard. They can be changed by setting OMP_SCHEDULE environment variable to type[,chunk] where

type is one of static, dynamic, guided, or auto
chunk is an optional positive integer that specifies the chunk size

Another way of changing schedule kind is calling openmp function omp_set_schedule

The is_prime function can be rather fast. /I suggest/

  prime[i-2] = is_prime(i);

So, the problem can came from wrong scheduling mode, when a little number is executed before barrier from scheduling.

And the printf have 2 parts inside it /I consider glibc as popular Linux libc implementation/

Parse the format string and put all parameters into buffer
Write buffer to file descriptor (to FILE buffer, as stdout is buffered by glibc by default)

The first part of printf can be done in parallel, but second part is a critical section and it is locked with _IO_flockfile.

回答4:

What were your timings - was it much slower with the printf's? In some tight loops the printf's might take a large fraction of the total computing time; for example if is_prime() is pretty fast, and therefore the performance is determined more by the number of calls to printf than the number of (parallelized) calls to is_prime().

来源：https://stackoverflow.com/questions/2711456/no-speed-up-with-useless-printfs-using-openmp

标签

multithreading

printf

parallel-processing

openmp

performance