openmp

Can't get over 50% max. theoretical performance on matrix multiply

一世执手 提交于 2019-12-31 10:22:41
问题 Problem I am learning about HPC and code optimization. I attempt to replicate the results in Goto's seminal matrix multiplication paper (http://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf). Despite my best efforts, I cannot get over ~50% maximum theoretical CPU performance. Background See related issues here (Optimized 2x2 matrix multiplication: Slow assembly versus fast SIMD), including info about my hardware What I have attempted This related paper (http://www.cs

Can't get over 50% max. theoretical performance on matrix multiply

醉酒当歌 提交于 2019-12-31 10:22:08
问题 Problem I am learning about HPC and code optimization. I attempt to replicate the results in Goto's seminal matrix multiplication paper (http://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf). Despite my best efforts, I cannot get over ~50% maximum theoretical CPU performance. Background See related issues here (Optimized 2x2 matrix multiplication: Slow assembly versus fast SIMD), including info about my hardware What I have attempted This related paper (http://www.cs

Specify OpenMP to GCC

╄→尐↘猪︶ㄣ 提交于 2019-12-31 09:16:43
问题 For OpenMP, when my code is using the functions in its API (for example, omp_get_thread_num()) without using its directives (such as those #pragma omp ...), why directly specifying libgomp.a to gcc instead of using -fopenmp doesn't work, such as gcc hello.c /usr/lib/gcc/i686-linux-gnu/4.4/libgomp.a -o hello Update: I just found that linking to libgomp.a does not work, but linking to libgomp.so works. Does it mean OpenMP can not be static linked? Why -fopenmp only works without specifying the

Parallelization for Monte Carlo pi approximation

≯℡__Kan透↙ 提交于 2019-12-31 04:50:47
问题 I am writing a c script to parallelize pi approximation with OpenMp. I think my code works fine with a convincing output. I am running it with 4 threads now. What I am not sure is that if this code is vulnerable to race condition? and if it is, how do I coordinate the thread action in this code ? the code looks as follows: #include <stdlib.h> #include <stdio.h> #include <time.h> #include <math.h> #include <omp.h> double sample_interval(double a, double b) { double x = ((double) rand())/(

Is it much faster to re-initialize a vector using OpenMP threads?

我们两清 提交于 2019-12-31 04:33:11
问题 I'm using OpenMP libraries for parallel computing. I use C++ vectors, whose size is usually in the order of 1*10^5. While going through iteration process, I need to re-initialize a bunch of these large vectors(not thread private but global scope) to a initial value. which is the faster way to do this?, using #pragma omp for or #pragma omp single? 回答1: The general answer would need to be "it depends, you have to measure" since initialization in C++ can be, depending on the type, trivial or

find_first of a vector in parallel in C++

时光毁灭记忆、已成空白 提交于 2019-12-31 04:20:30
问题 I have a quite big vector. Some of the vector members are matching a certain condition in parallel. I would like to find the first element matching to the condition. My problem is very similar to this question (tbb: parallel find first element) , but I do not have tbb. Checking condition is very tedious (so I cannot do it for all of them sequentially). That's why I would like to run it in parallel. I have to mention that I would like to find the first element (so the index position of the

OpenMP - Why does the number of comparisons decrease?

你。 提交于 2019-12-31 04:07:09
问题 I have the following algorithm: int hostMatch(long *comparisons) { int i = -1; int lastI = textLength-patternLength; *comparisons=0; #pragma omp parallel for schedule(static, 1) num_threads(1) for (int k = 0; k <= lastI; k++) { int j; for (j = 0; j < patternLength; j++) { (*comparisons)++; if (textData[k+j] != patternData[j]) { j = patternLength+1; //break } } if (j == patternLength && k > i) i = k; } return i; } When changing num_threads I get the following results for number of comparisons:

Multiplatform multiprocessing?

◇◆丶佛笑我妖孽 提交于 2019-12-31 03:17:37
问题 I was wondering why in the new C++11 they added threads and not processes. Couldn't have they done a wrapper around platform specific functions? Any suggestion about the most portable way to do multiprocessing? fork()? OpenMP? 回答1: If you could use Qt, QProcess class could be an elegant platform independent solution. 回答2: If you want to do this portably I'd suggest you avoid calling fork() directly and instead write your own library function that can be mapped on to a combination of fork()

OpenMP parallelisation of pi calculation is either slow or wrong

可紊 提交于 2019-12-31 00:33:08
问题 I'm having trouble parallelising my monte carlo method to calculate pi. Here is the parallelised for-loop: #pragma omp parallel for private(i,x,y) schedule(static) reduction(+:count) for (i = 0; i < points; i++) { x = rand()/(RAND_MAX+1.0)*2 - 1.0; y = rand()/(RAND_MAX+1.0)*2 - 1.0; // Check if point lies in circle if(x*x + y*y < 1.0) { count++; } } The problem is, it underestimates pi if I use schedule(static) , and its slower than the serial implementation if I use schedule(dynamic) . What

OpenMP parallelisation of pi calculation is either slow or wrong

拥有回忆 提交于 2019-12-31 00:32:31
问题 I'm having trouble parallelising my monte carlo method to calculate pi. Here is the parallelised for-loop: #pragma omp parallel for private(i,x,y) schedule(static) reduction(+:count) for (i = 0; i < points; i++) { x = rand()/(RAND_MAX+1.0)*2 - 1.0; y = rand()/(RAND_MAX+1.0)*2 - 1.0; // Check if point lies in circle if(x*x + y*y < 1.0) { count++; } } The problem is, it underestimates pi if I use schedule(static) , and its slower than the serial implementation if I use schedule(dynamic) . What