reduction

Reduction of matrix rows in OpenCL

半腔热情 提交于 2019-12-11 06:28:55
问题 I have an matrix which is stored as 1D array in the GPU, I'm trying to make an OpenCL kernel which will use reduction in every row of this matrix, for example: Let's consider my matrix is 2x3 with the elements [1, 2, 3, 4, 5, 6], what I want to do is: [1, 2, 3] = [ 6] [4, 5, 6] [15] Obviously as I'm talking about reduction, the actual return could be of more than one element per row: [1, 2, 3] = [3, 3] [4, 5, 6] [9, 6] Then the final calculation I can do in another kernel or in the CPU. Well,

Reduce multiple blocks of equal length that are arranged in a big vector Using CUDA

点点圈 提交于 2019-12-11 04:25:34
问题 I am looking for a fast way to reduce multiple blocks of equal length that are arranged as a big vector. I have N subarrays(contiguous elements) that are arranged in one big array. each sub array has a fixed size : k. so the size of the whole array is : N*K What I'm doing is to call the kernel N times. in each time it computes the reduction of the subarray as follow: I will iterate over all the subarrays contained in the big vector : for(i=0;i<N;i++){ thrust::device_vector< float > Vec

Haskell: Alternative, non-circular definition of Redex?

江枫思渺然 提交于 2019-12-10 22:35:59
问题 I got quite confused about what is and is not a redex in Haskell, so I spent some time on it, but I would like feedback whether I got it right. I found this definition of a redex, and it is circular; Etymology : From "reducible expression" Definition: Redex (plural redexes): (mathematics) Something to be reduced according to the rules of a formal system. http://en.wiktionary.org/wiki/redex The above definition presumes one knows how to reduce. So to me this is like saying "Bluish is the

Struggling with intuition regarding how warp-synchronous thread execution works

别说谁变了你拦得住时间么 提交于 2019-12-10 20:48:22
问题 I am new in CUDA. I am working basic parallel algorithms, like reduction, in order to understand how thread execution is working. I have the following code: __global__ void Reduction2_kernel( int *out, const int *in, size_t N ) { extern __shared__ int sPartials[]; int sum = 0; const int tid = threadIdx.x; for ( size_t i = blockIdx.x*blockDim.x + tid; i < N; i += blockDim.x*gridDim.x ) { sum += in[i]; } sPartials[tid] = sum; __syncthreads(); for ( int activeThreads = blockDim.x>>1;

omp max reduction with storage of index

自闭症网瘾萝莉.ら 提交于 2019-12-10 19:45:36
问题 Using c++ openmp 3.1 I implemented a max reduction which stores the maximum value of integer variable (score) of an vector of objects (s). But I also want to store the vector index to acces the (s) object with the maximum score. My current unsuccesfull implementation looks like this: //s is a vector of sol objects which contain apart from other variables an integer score variable s[].score int bestscore = 0; int bestant = 0; #pragma omp parallel shared(bestant) {//start parallel session

Flow Shop to Boolean satisfiability [Polynomial-time reduction]

て烟熏妆下的殇ゞ 提交于 2019-12-10 13:43:54
问题 I contact you in order to get an idea on "how to transform a flow shop scheduling problem" into a boolean satisfiability. I already done such reduction for a N*N Sudoku, a N-queens and a Class scheduling problem, but I have some issue on how to transform the flow shop into SAT. a SAT problem looks like this : The goal is : with different boolean variables, to find an affectation of every variable in order to make the "sentence" true. (If finding a solution is possible). I create my own solver

How to implement argmax with OpenMP?

南笙酒味 提交于 2019-12-10 11:44:16
问题 I am trying to implement a argmax with OpenMP. If short, I have a function that computes a floating point value: double toOptimize(int val); I can get the integer maximizing the value with: double best = 0; #pragma omp parallel for reduction(max: best) for(int i = 2 ; i < MAX ; ++i) { double v = toOptimize(i); if(v > best) best = v; } Now, how can I get the value i corresponding to the maximum? Edit: I am trying this, but would like to make sure it is valid: double best_value = 0; int best

Finding max value in CUDA

混江龙づ霸主 提交于 2019-12-10 10:03:31
问题 I am trying to write a code in CUDA for finding the max value for the given set of numbers. Assume you have 20 numbers, and the kernel is running on 2 blocks of 5 threads. Now assume the 10 threads compare the first 10 values at the same time, and thread 2 finds a max value, so thread 2 is updating the max value variable in global memory. While thread 2 is updating, what will happen to the remaining threads (1,3-10) that will be comparing using the old value? If I lock the global variable

Block reduction in CUDA

穿精又带淫゛_ 提交于 2019-12-10 02:45:16
问题 I am trying to do reduction in CUDA and I am really a newbie. I am currently studying a sample code from NVIDIA. I guess I am really not sure how to set up the block size and grid size, especially when my input array is larger ( 512 X 512 ) than a single block size. Here is the code. template <unsigned int blockSize> __global__ void reduce6(int *g_idata, int *g_odata, unsigned int n) { extern __shared__ int sdata[]; unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*(blockSize*2) +

CUDA - why is warp based parallel reduction slower?

元气小坏坏 提交于 2019-12-09 09:18:36
问题 I had the idea about a warp based parallel reduction since all threads of a warp are in sync by definition. So the idea was that the input data can be reduced by factor 64 (each thread reduces two elements) without any synchronization need. Same as the original implementation by Mark Harris the reduction is applied on block-level and data is on shared memory. http://gpgpu.org/static/sc2007/SC07_CUDA_5_Optimization_Harris.pdf I created a kernel to test his version and my warp based version.