reduction

CUDA: reduction or atomic operations?

我是研究僧i 提交于 2019-12-19 09:24:12
问题 I'm writing a CUDA kernel which involves calculating the maximum value on a given matrix and I'm evaluating possibilities. The best way I could find is: Forcing every thread to store a value in the shared memory and using a reduction algorithm after that to determine the maximum (pro: minimum divergence cons: shared memory is limited to 48Kb on 2.0 devices) I couldn't use atomic operations because there are both a reading and a writing operation, so threads could not be synchronized by

OpenCL float sum reduction

▼魔方 西西 提交于 2019-12-19 07:36:27
问题 I would like to apply a reduce on this piece of my kernel code (1 dimensional data): __local float sum = 0; int i; for(i = 0; i < length; i++) sum += //some operation depending on i here; Instead of having just 1 thread that performs this operation, I would like to have n threads (with n = length) and at the end having 1 thread to make the total sum. In pseudo code, I would like to able to write something like this: int i = get_global_id(0); __local float sum = 0; sum += //some operation

OpenCL float sum reduction

痞子三分冷 提交于 2019-12-19 07:36:14
问题 I would like to apply a reduce on this piece of my kernel code (1 dimensional data): __local float sum = 0; int i; for(i = 0; i < length; i++) sum += //some operation depending on i here; Instead of having just 1 thread that performs this operation, I would like to have n threads (with n = length) and at the end having 1 thread to make the total sum. In pseudo code, I would like to able to write something like this: int i = get_global_id(0); __local float sum = 0; sum += //some operation

Openmp and reduction on std::vector?

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-17 16:38:09
问题 I want to make this code parallel: std::vector<float> res(n,0); std::vector<float> vals(m); std::vector<float> indexes(m); // fill indexes with values in range [0,n) // fill vals and indexes for(size_t i=0; i<m; i++){ res[indexes[i]] += //something using vas[i]; } In this article it's suggested to use: #pragma omp parallel for reduction(+:myArray[:6]) In this question the same approach is proposed in the comments section. I have two questions: I don't know m at compile time, and from these

OpenCL - using atomic reduction for double

泄露秘密 提交于 2019-12-14 02:29:17
问题 I know atomic functions with OpenCL-1.x are not recommended but I just want to understand an atomic example. The following kernel code is not working well, it produces random final values for the computation of sum of all array values (sum reduction) : #pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable void atom_add_double(volatile __local double *val, double delta) { union { double f; ulong i; } old, new; do { old.f = *val; new.f = old.f + delta; } while (atom_cmpxchg((volatile _

MPI_Reduce doesn't work as expected

牧云@^-^@ 提交于 2019-12-13 15:25:52
问题 I am very new to MPI and I'm trying to use MPI_Reduce to find the maximum of an integer array. I have an integer array arr of size arraysize , and here is my code: MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &number_of_processes); MPI_Comm_rank(MPI_COMM_WORLD, &my_process_id); MPI_Bcast(arr, arraysize, MPI_INT, 0, MPI_COMM_WORLD); MPI_Reduce(arr, &result, arraysize, MPI_INT, MPI_MAX, 0, MPI_COMM_WORLD); if(!my_process_id){ printf("%d", result); } MPI_Finalize(); My program compiles

Cumulative sum in two dimensions on array in nested loop — CUDA implementation?

我们两清 提交于 2019-12-12 18:38:42
问题 I have been thinking of how to perform this operation on CUDA using reductions, but I'm a bit at a loss as to how to accomplish it. The C code is below. The important part to keep in mind -- the variable precalculatedValue depends on both loop iterators. Also, the variable ngo is not unique to every value of m ... e.g. m = 0,1,2 might have ngo = 1, whereas m = 4,5,6,7,8 could have ngo = 2, etc. I have included sizes of loop iterators in case it helps to provide better implementation

Cuda Kernel with reduction - logic errors for dot product of 2 matrices

二次信任 提交于 2019-12-12 04:49:57
问题 I am just starting off with CUDA and am trying to wrap my brain around CUDA reduction algorithm. In my case, I have been trying to get the dot product of two matrices. But I am getting the right answer for only matrices with size 2. For any other size matrices, I am getting it wrong. This is only the test so I am keeping matrix size very small. Only about 100 so only 1 block would fit it all. Any help would be greatly appreciated. Thanks! Here is the regular code float* ha = new float[n]; //

L = {T | T is a turing machine that recognizes {00, 01}} Prove L is undecidable

岁酱吖の 提交于 2019-12-12 04:44:37
问题 L = {<T> | T is a turing machine that recognizes {00, 01}} Prove L is undecidable. I am really having difficulties even understanding the reduction to use here. I'm not asking for free lunch, just a push in the right direction. 回答1: A direct application of Rice's theorem will let you prove this without doing any work at all. Some Turing machines recognize {00, 01}. Some don't. The difference is semantic in that it has to do with the strings accepted, not the structure of the automaton. Hence,

opencl- parallel reduction without local memory

不羁的心 提交于 2019-12-11 11:12:47
问题 Most of the algorithms for parallel reduction uses shared(local) memory. Nvidia,AMD, Intel and so on. But if devices has doesn't have shared(local) memory. How can I do it? If i use same algorithms but store temporary value on global memory, is it gonna be work fine? 回答1: If I think about it, my comment already was the complete answer. Yes, you can use global memory as a replacement for local memory but: you have to allocate enough global memory for all workgroups and assign the workgroups