reduction | 易学教程

Reduction of matrix rows in OpenCL

阅读更多关于 Reduction of matrix rows in OpenCL

问题 I have an matrix which is stored as 1D array in the GPU, I'm trying to make an OpenCL kernel which will use reduction in every row of this matrix, for example: Let's consider my matrix is 2x3 with the elements [1, 2, 3, 4, 5, 6], what I want to do is: [1, 2, 3] = [ 6] [4, 5, 6] [15] Obviously as I'm talking about reduction, the actual return could be of more than one element per row: [1, 2, 3] = [3, 3] [4, 5, 6] [9, 6] Then the final calculation I can do in another kernel or in the CPU. Well,

Reduce multiple blocks of equal length that are arranged in a big vector Using CUDA

阅读更多关于 Reduce multiple blocks of equal length that are arranged in a big vector Using CUDA

问题 I am looking for a fast way to reduce multiple blocks of equal length that are arranged as a big vector. I have N subarrays(contiguous elements) that are arranged in one big array. each sub array has a fixed size : k. so the size of the whole array is : N*K What I'm doing is to call the kernel N times. in each time it computes the reduction of the subarray as follow: I will iterate over all the subarrays contained in the big vector : for(i=0;i<N;i++){ thrust::device_vector< float > Vec

Haskell: Alternative, non-circular definition of Redex?

阅读更多关于 Haskell: Alternative, non-circular definition of Redex?

问题 I got quite confused about what is and is not a redex in Haskell, so I spent some time on it, but I would like feedback whether I got it right. I found this definition of a redex, and it is circular; Etymology : From "reducible expression" Definition: Redex (plural redexes): (mathematics) Something to be reduced according to the rules of a formal system. http://en.wiktionary.org/wiki/redex The above definition presumes one knows how to reduce. So to me this is like saying "Bluish is the

Struggling with intuition regarding how warp-synchronous thread execution works

阅读更多关于 Struggling with intuition regarding how warp-synchronous thread execution works

问题 I am new in CUDA. I am working basic parallel algorithms, like reduction, in order to understand how thread execution is working. I have the following code: __global__ void Reduction2_kernel( int *out, const int *in, size_t N ) { extern __shared__ int sPartials[]; int sum = 0; const int tid = threadIdx.x; for ( size_t i = blockIdx.x*blockDim.x + tid; i < N; i += blockDim.x*gridDim.x ) { sum += in[i]; } sPartials[tid] = sum; __syncthreads(); for ( int activeThreads = blockDim.x>>1;

omp max reduction with storage of index

阅读更多关于 omp max reduction with storage of index

问题 Using c++ openmp 3.1 I implemented a max reduction which stores the maximum value of integer variable (score) of an vector of objects (s). But I also want to store the vector index to acces the (s) object with the maximum score. My current unsuccesfull implementation looks like this: //s is a vector of sol objects which contain apart from other variables an integer score variable s[].score int bestscore = 0; int bestant = 0; #pragma omp parallel shared(bestant) {//start parallel session

Flow Shop to Boolean satisfiability [Polynomial-time reduction]

阅读更多关于 Flow Shop to Boolean satisfiability [Polynomial-time reduction]

问题 I contact you in order to get an idea on "how to transform a flow shop scheduling problem" into a boolean satisfiability. I already done such reduction for a N*N Sudoku, a N-queens and a Class scheduling problem, but I have some issue on how to transform the flow shop into SAT. a SAT problem looks like this : The goal is : with different boolean variables, to find an affectation of every variable in order to make the "sentence" true. (If finding a solution is possible). I create my own solver

How to implement argmax with OpenMP?

阅读更多关于 How to implement argmax with OpenMP?

问题 I am trying to implement a argmax with OpenMP. If short, I have a function that computes a floating point value: double toOptimize(int val); I can get the integer maximizing the value with: double best = 0; #pragma omp parallel for reduction(max: best) for(int i = 2 ; i < MAX ; ++i) { double v = toOptimize(i); if(v > best) best = v; } Now, how can I get the value i corresponding to the maximum? Edit: I am trying this, but would like to make sure it is valid: double best_value = 0; int best

Finding max value in CUDA

阅读更多关于 Finding max value in CUDA

问题 I am trying to write a code in CUDA for finding the max value for the given set of numbers. Assume you have 20 numbers, and the kernel is running on 2 blocks of 5 threads. Now assume the 10 threads compare the first 10 values at the same time, and thread 2 finds a max value, so thread 2 is updating the max value variable in global memory. While thread 2 is updating, what will happen to the remaining threads (1,3-10) that will be comparing using the old value? If I lock the global variable

Block reduction in CUDA

阅读更多关于 Block reduction in CUDA

问题 I am trying to do reduction in CUDA and I am really a newbie. I am currently studying a sample code from NVIDIA. I guess I am really not sure how to set up the block size and grid size, especially when my input array is larger ( 512 X 512 ) than a single block size. Here is the code. template <unsigned int blockSize> __global__ void reduce6(int *g_idata, int *g_odata, unsigned int n) { extern __shared__ int sdata[]; unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*(blockSize*2) +

CUDA - why is warp based parallel reduction slower?

阅读更多关于 CUDA - why is warp based parallel reduction slower?

问题 I had the idea about a warp based parallel reduction since all threads of a warp are in sync by definition. So the idea was that the input data can be reduced by factor 64 (each thread reduces two elements) without any synchronization need. Same as the original implementation by Mark Harris the reduction is applied on block-level and data is on shared memory. http://gpgpu.org/static/sc2007/SC07_CUDA_5_Optimization_Harris.pdf I created a kernel to test his version and my warp based version.