openmp

OpenMP: Huge performance differences between Visual C++ 2008 and 2010

为君一笑 提交于 2019-12-03 18:38:10
问题 I'm running a camera acquisition program that performs processing on acquired images, and I'm using simple OpenMP directives for this processing. So basically I wait for an image from the camera, and then process it. When migrating to VC2010, I see very strange performance hog : under VC2010 my app is taking nearly 100% CPU while it is taking only 10% under VC2008. If I benchmark only the processing code I get no difference between VC2010 and VC2008, the difference occurs when using the

Parallelization of PNG file creation with C++, libpng and OpenMP

落爺英雄遲暮 提交于 2019-12-03 17:47:28
问题 I am currently trying to implement a PNG encoder in C++ based on libpng that uses OpenMP to speed up the compression process. The tool is already able to generate PNG files from various image formats. I uploaded the complete source code to pastebin.com so you can see what I have done so far: http://pastebin.com/8wiFzcgV So far, so good! Now, my problem is to find a way how to parallelize the generation of the IDAT chunks containing the compressed image data. Usually, the libpng function png

Why the OpenMP SIMD directive reduces performance?

南楼画角 提交于 2019-12-03 17:22:31
I am learning how to use SIMD directives with OpenMP/Fortran. I wrote the simple code: program loop implicit none integer :: i,j real*8 :: x x = 0.0 do i=1,10000 do j=1,10000000 x = x + 1.0/(1.0*i) enddo enddo print*, x end program loop when I compile this code and run it I get: ifort -O3 -vec-report3 -xhost loop_simd.f90 loop_simd.f90(10): (col. 12) remark: LOOP WAS VECTORIZED loop_simd.f90(9): (col. 7) remark: loop was not vectorized: not inner loop time ./a.out 97876060.8355515 real 0m8.940s user 0m8.937s sys 0m0.005s I did what the compiler suggested about the "not inner loop" and added a

Openmp编程练习

不想你离开。 提交于 2019-12-03 15:37:07
火车卖票 // OpenMP2.cpp : 定义控制台应用程序的入口点。 // #include "stdio.h" #include "omp.h" #include <windows.h> //使用Sleep()函数需要包含此头文件 int num; omp_lock_t lock; int getnum() { int temp = num; //omp_set_nest_lock(&lock); #pragma omp atomic num--; //omp_unset_nest_lock(&lock); return num+1; } void chushou(int i) { int s = getnum(); while (s >= 0) { omp_set_lock(&lock); printf("站点%d卖掉了第%d张票\n", i, s); s = getnum(); omp_unset_lock(&lock); Sleep(500); } } int main() { num = 100; int myid; omp_init_lock(&lock); #pragma omp parallel private(myid) num_threads(4) { myid = omp_get_thread_num(); //printf("my id is:%d\n"

Terrible performance - a simple issue of overhead, or is there a program flaw?

↘锁芯ラ 提交于 2019-12-03 15:28:55
I have here what I understand to be a relatively simple OpenMP construct. The issue is that the program runs about 100-300x faster with 1 thread when compared to 2 threads. 87% of the program is spent in gomp_send_wait() and another 9.5% in gomp_send_post . The program gives correct results, but I wonder if there is a flaw in the code that is causing some resource conflict, or if it is simply that the overhead of the thread creation is drastically not worth it for a a loop of chunk size 4. p ranges from 17 to 1000, depending on the size of the molecule we're simulating. My numbers are for the

Set number of threads using omp_set_num_threads() to 2, but omp_get_num_threads() returns 1

爱⌒轻易说出口 提交于 2019-12-03 15:03:41
问题 I have the following C/C++ code using OpenMP: int nProcessors=omp_get_max_threads(); if(argv[4]!=NULL){ printf("argv[4]: %s\n",argv[4]); nProcessors=atoi(argv[4]); printf("nProcessors: %d\n",nProcessors); } omp_set_num_threads(nProcessors); printf("omp_get_num_threads(): %d\n",omp_get_num_threads()); exit(0); As you can see, I'm trying to set the number of processors to use based on an argument passed on the command line. However, I'm getting the following output: argv[4]: 2 //OK nProcessors:

Why does while loop in an OMP parallel section fail to terminate when termination condition depends on update from different section

我怕爱的太早我们不能终老 提交于 2019-12-03 14:29:43
Is the C++ code below legal, or is there a problem with my compiler? The code was complied into a shared library using gcc version 4.4.6 20110731 (Red Hat 4.4.6-3) (GCC) and openMP and then called via R 2.15.2. int it=0; #pragma omp parallel sections shared(it) { #pragma omp section { std::cout<<"Entering section A"<<std::endl; for(it=0;it<10;it++) { std::cout<<"Iteration "<<it<<std::endl; } std::cout<<"Leaving section A with it="<<it<<std::endl; } #pragma omp section { std::cout<<"Entering section B with it="<<it<<std::endl; while(it<10) { 1; } std::cout<<"Leaving section B"<<std::endl; } } I

OpenMP drastic slowdown for specific thread number

断了今生、忘了曾经 提交于 2019-12-03 14:23:06
I ran an OpenMP program to perform the Jacobi method, and it was working very well, 2 threads performed slightly over 2x 1 thread, and 4 threads 2x faster than 1 thread. I felt everything was working perfectly... until I reached exactly 20, 22, and 24 threads. I kept breaking it down until I had this simple program #include <stdio.h> #include <omp.h> int main(int argc, char *argv[]) { int i, n, maxiter, threads, nsquared, execs = 0; double begin, end; if (argc != 4) { printf("4 args\n"); return 1; } n = atoi(argv[1]); threads = atoi(argv[2]); maxiter = atoi(argv[3]); omp_set_num_threads

OpenMP parallelizing matrix multiplication by a triple for loop (performance issue)

蓝咒 提交于 2019-12-03 14:10:50
问题 I'm writing a program for matrix multiplication with OpenMP, that, for cache convenience, implements the multiplication A x B(transpose) rows X rows instead of the classic A x B rows x columns, for better cache efficiency. Doing this I faced an interesting fact that for me is illogic: if in this code i parallelize the extern loop the program is slower than if I put the OpenMP directives in the most inner loop, in my computer the times are 10.9 vs 8.1 seconds. //A and B are double* allocated

Fetch-and-add using OpenMP atomic operations

谁都会走 提交于 2019-12-03 13:44:25
I’m using OpenMP and need to use the fetch-and-add operation. However, OpenMP doesn’t provide an appropriate directive/call. I’d like to preserve maximum portability, hence I don’t want to rely on compiler intrinsics. Rather, I’m searching for a way to harness OpenMP’s atomic operations to implement this but I’ve hit a dead end. Can this even be done? N.B., the following code almost does what I want: #pragma omp atomic x += a Almost – but not quite, since I really need the old value of x . fetch_and_add should be defined to produce the same result as the following (only non-locking): template