How to implement nested loops in cuda thrust
问题 I currently have to run a nested loop as follow: for(int i = 0; i < N; i++){ for(int j = i+1; j <= N; j++){ compute(...)//some calculation here } } I've tried leaving the first loop in CPU and do the second loop in GPU . Results are too many memory access . Is there any other ways to do it? For example by thrust::reduce_by_key ? The whole program is here: #include <thrust/device_vector.h> #include <thrust/host_vector.h> #include <thrust/generate.h> #include <thrust/sort.h> #include <thrust