openmp | 易学教程

How can I set the number of OpenMP threads from within the program?

阅读更多关于 How can I set the number of OpenMP threads from within the program?

问题 Running the program as $ OMP_NUM_TRHEADS=4 ./a.out limits the number of active OpenMP threads to 4, as evidenced by htop . However, if instead of binding the OMP_NUM_THREADS environment variable in Bash , I call setenv("OMP_NUM_THREADS", "4", 1); from main before calling any OpenMP-enabled functions, this seems to have no effect. Why is this happening? How can I set the number of OpenMP threads from within the program, if it's possible at all? 回答1: There are two ways 1 one can use to set the

incomprehensible performance improvement with openmp even when num_threads(1)

阅读更多关于 incomprehensible performance improvement with openmp even when num_threads(1)

问题 The following lines of code int nrows = 4096; int ncols = 4096; size_t numel = nrows * ncols; unsigned char *buff = (unsigned char *) malloc( numel ); unsigned char *pbuff = buff; #pragma omp parallel for schedule(static), firstprivate(pbuff, nrows, ncols), num_threads(1) for (int i=0; i<nrows; i++) { for (int j=0; j<ncols; j++) { *pbuff += 1; pbuff++; } } take 11130 usecs to run on my i5-3230M when compiled with g++ -o main main.cpp -std=c++0x -O3 That is, when the openmp pragmas are ignored

OpenMP: don't use hyperthreading cores (half `num_threads()` w/ hyperthreading)

阅读更多关于 OpenMP: don't use hyperthreading cores (half `num_threads()` w/ hyperthreading)

问题 In Is OpenMP (parallel for) in g++ 4.7 not very efficient? 2.5x at 5x CPU, I determined that the performance of my programme varies between 11s and 13s (mostly always above 12s, and sometimes as slow as 13.4s) at around 500% CPU when using the default #pragma omp parallel for , and the OpenMP speed up is only 2.5x at 5x CPU w/ g++-4.7 -O3 -fopenmp , on a 4-core 8-thread Xeon. I tried using schedule(static) num_threads(4) , and noticed that my programme always completes in 11.5s to 11.7s

OpenMP and GSL RNG - Performance Issue - 4 threads implementation 10x slower than pure sequential one (quadcore CPU)

阅读更多关于 OpenMP and GSL RNG - Performance Issue - 4 threads implementation 10x slower than pure sequential one (quadcore CPU)

问题 I am trying to turn a C project of mine from sequential into parallel programming. Although most of the code has now been redesigned from scratch for this purpose, the generation of random numbers is still at its core. Thus, bad performance of the random number generator (RNG) affects very badly the overall performance of the program. I wrote some code lines (see below) to show the problem I am facing without much verbosity. The problem is the following: everytime the number of threads nt

Can I assign multiple threads to a code section in OpenMP?

阅读更多关于 Can I assign multiple threads to a code section in OpenMP?

问题 I'm looking for a way to execute sections of code in parallel using multiple threads for each section. For example, if I have 16 threads and two tasks, I want 8 threads each to simultaneously execute those two tasks. OpenMP has several constructs ( section , task ) that execute general code in parallel, but they are single-threaded. In my scenario, using section or task would result in one thread executing each of the two tasks, while 14 threads sit idly by. Is something like that even

Shared vectors in OpenMP

阅读更多关于 Shared vectors in OpenMP

问题 I am trying to parallize a program I am using and got the following question. Will I get a loss of performance if multiple threads need to read/write on the same vector but different elements of the vector ? I have the feeling thats the reason my program hardly gets any faster upon parallizing it. Take the following code: #include <vector> int main(){ vector<double> numbers; vector<double> results(10); double x; //write 10 values in vector numbers for (int i =0; i<10; i++){ numbers.push_back

How can Microsoft's OpenMP spinlock time be controlled?

阅读更多关于 How can Microsoft's OpenMP spinlock time be controlled?

问题 The OpenMP used by the Intel compiler supports an environment variable KMP_BLOCKTIME (docs) which I believe controls the busy-waiting (spinlocked) time the threads will spend waiting for new work (linked document claims this defaults to 200ms). The OpenMP used by the Gnu compiler supports an environment variable GOMP_SPINCOUNT (docs) which I believe also controls that library's equivalent implementation detail (although apparently expressed as an iteration count rather than a time). My

Why is my computer not showing a speedup when I use parallel code?

阅读更多关于 Why is my computer not showing a speedup when I use parallel code?

问题 So I realize this question sounds stupid (and yes I am using a dual core), but I have tried two different libraries (Grand Central Dispatch and OpenMP), and when using clock() to time the code with and without the lines that make it parallel, the speed is the same. (for the record they were both using their own form of parallel for). They report being run on different threads, but perhaps they are running on the same core? Is there any way to check? (Both libraries are for C, I'm

Parallel Merge-Sort in OpenMP

阅读更多关于 Parallel Merge-Sort in OpenMP

问题 I have seen an algorithm for parallel merge-sort in a this paper. This is the code: void mergesort_parallel_omp (int a[], int size, int temp[], int threads) { if ( threads == 1) { mergesort_serial(a, size, temp); } else if (threads > 1) { #pragma omp parallel sections { #pragma omp section mergesort_parallel_omp(a, size/2, temp, threads/2); #pragma omp section mergesort_parallel_omp(a + size/2, size - size/2, temp + size/2, threads - threads/2); } merge(a, size, temp); } // threads > 1 } I

How to parallelize correctly a nested for loops

阅读更多关于 How to parallelize correctly a nested for loops

问题 I'm working with OpenMP to parallelize a scalar nested for loop: double P[N][N]; double x=0.0,y=0.0; for (int i=0; i<N; i++) { for (int j=0; j<N; j++) { P[i][j]=someLongFunction(x,y); y+=1; } x+=1; } In this loop the important thing is that matrix P must be the same in both scalar and parallel versions: All my possible trials didn't succeed... 回答1: The problem here is that you have added iteration-to-iteration dependencies with: x+=1; y+=1; Therefore, as the code stands right now, it is not