large integer addition with CUDA
问题 I've been developing a cryptographic algorithm on the GPU and currently stuck with an algorithm to perform large integer addition. Large integers are represented in a usual way as a bunch of 32-bit words. For example, we can use one thread to add two 32-bit words. For simplicity, let assume that the numbers to be added are of the same length and number of threads per block == number of words. Then: __global__ void add_kernel(int *C, const int *A, const int *B) { int x = A[threadIdx.x]; int y