I am confused by __shfl_down and __shfl_down_sync , they give differetn resutls.
__global__ void shufledown1(double* a, double *b,double *c, int N) { double temp = 2.0; __syncthreads(); for (int offset = 32/2; offset > 0; offset /= 2){ temp+=__shfl_down(temp, offset,32); } printf("%d %f %d \n",threadIdx.x ,temp,blockDim.x * gridDim.x); } __global__ void shufledown2(double* a, double *b,double *c, int N) { double temp = 2.0; __syncthreads(); for (int offset = 32/2; offset > 0; offset /= 2){ temp+=__shfl_down_sync(temp, offset,32) } printf("%d %f %d \n",threadIdx.x ,temp,blockDim.x * gridDim.x); }
The first one gave:
0 64.000000 64 ''''' '''' '''' 63 64.000000 64
The second one gave:
0 33.000000 64 ''''' '''' '''' 63 33.000000 64
The kernel was run with 1 block 64 threads. Regards
In addition to just a different name, the _sync
versions of the warp shuffle functions also have a different prototype, as indicated in the documentation. The first parameter is a mask parameter.
You appear to be trying to use both functions in the same way:
temp+=__shfl_down(temp, offset,32); temp+=__shfl_down_sync(temp, offset,32);
but that is incorrect. To use the _sync
version in an analogous fashion, you should do:
temp+=__shfl_down_sync(0xFFFFFFFF, temp, offset,32);
When I make that change, your code runs correctly for me:
#include <stdio.h> __global__ void shufledown1(double* a, double *b,double *c, int N) { double temp = 2.0; __syncthreads(); for (int offset = 32/2; offset > 0; offset /= 2){ temp+=__shfl_down(temp, offset,32); } printf("%d %f %d \n",threadIdx.x ,temp,blockDim.x * gridDim.x); } __global__ void shufledown2(double* a, double *b,double *c, int N) { double temp = 2.0; __syncthreads(); for (int offset = 32/2; offset > 0; offset /= 2){ temp+=__shfl_down_sync(0xFFFFFFFF, temp, offset,32); } printf("%d %f %d \n",threadIdx.x ,temp,blockDim.x * gridDim.x); } int main(){ double *a = NULL, *b = NULL, *c = NULL; shufledown1<<<1,64>>>(a, b, c, 0); cudaDeviceSynchronize(); shufledown2<<<1,64>>>(a, b, c, 0); cudaDeviceSynchronize(); } $ nvcc -arch=sm_60 -o t1358 t1358.cu t1358.cu(9): warning: function "__shfl_down(double, unsigned int, int)" /usr/local/cuda/bin/..//include/sm_30_intrinsics.hpp(453): here was declared deprecated ("__shfl_down() is deprecated in favor of __shfl_down_sync() and may be removed in a future release (Use -Wno-deprecated-declarations to suppress this warning).") ptxas /tmp/tmpxft_000045b6_00000000-5_t1358.ptx, line 49; warning : Instruction 'shfl' without '.sync' is deprecated since PTX ISA version 6.0 and will be discontinued in a future PTX ISA version ptxas /tmp/tmpxft_000045b6_00000000-5_t1358.ptx, line 52; warning : Instruction 'shfl' without '.sync' is deprecated since PTX ISA version 6.0 and will be discontinued in a future PTX ISA version ptxas /tmp/tmpxft_000045b6_00000000-5_t1358.ptx, line 63; warning : Instruction 'shfl' without '.sync' is deprecated since PTX ISA version 6.0 and will be discontinued in a future PTX ISA version ptxas /tmp/tmpxft_000045b6_00000000-5_t1358.ptx, line 66; warning : Instruction 'shfl' without '.sync' is deprecated since PTX ISA version 6.0 and will be discontinued in a future PTX ISA version ptxas /tmp/tmpxft_000045b6_00000000-5_t1358.ptx, line 77; warning : Instruction 'shfl' without '.sync' is deprecated since PTX ISA version 6.0 and will be discontinued in a future PTX ISA version ptxas /tmp/tmpxft_000045b6_00000000-5_t1358.ptx, line 80; warning : Instruction 'shfl' without '.sync' is deprecated since PTX ISA version 6.0 and will be discontinued in a future PTX ISA version ptxas /tmp/tmpxft_000045b6_00000000-5_t1358.ptx, line 91; warning : Instruction 'shfl' without '.sync' is deprecated since PTX ISA version 6.0 and will be discontinued in a future PTX ISA version ptxas /tmp/tmpxft_000045b6_00000000-5_t1358.ptx, line 94; warning : Instruction 'shfl' without '.sync' is deprecated since PTX ISA version 6.0 and will be discontinued in a future PTX ISA version ptxas /tmp/tmpxft_000045b6_00000000-5_t1358.ptx, line 105; warning : Instruction 'shfl' without '.sync' is deprecated since PTX ISA version 6.0 and will be discontinued in a future PTX ISA version ptxas /tmp/tmpxft_000045b6_00000000-5_t1358.ptx, line 108; warning : Instruction 'shfl' without '.sync' is deprecated since PTX ISA version 6.0 and will be discontinued in a future PTX ISA version $ ./t1358 0 64.000000 64 1 64.000000 64 2 64.000000 64 3 64.000000 64 4 64.000000 64 5 64.000000 64 6 64.000000 64 7 64.000000 64 8 64.000000 64 9 64.000000 64 10 64.000000 64 11 64.000000 64 12 64.000000 64 13 64.000000 64 14 64.000000 64 15 64.000000 64 16 64.000000 64 17 64.000000 64 18 64.000000 64 19 64.000000 64 20 64.000000 64 21 64.000000 64 22 64.000000 64 23 64.000000 64 24 64.000000 64 25 64.000000 64 26 64.000000 64 27 64.000000 64 28 64.000000 64 29 64.000000 64 30 64.000000 64 31 64.000000 64 32 64.000000 64 33 64.000000 64 34 64.000000 64 35 64.000000 64 36 64.000000 64 37 64.000000 64 38 64.000000 64 39 64.000000 64 40 64.000000 64 41 64.000000 64 42 64.000000 64 43 64.000000 64 44 64.000000 64 45 64.000000 64 46 64.000000 64 47 64.000000 64 48 64.000000 64 49 64.000000 64 50 64.000000 64 51 64.000000 64 52 64.000000 64 53 64.000000 64 54 64.000000 64 55 64.000000 64 56 64.000000 64 57 64.000000 64 58 64.000000 64 59 64.000000 64 60 64.000000 64 61 64.000000 64 62 64.000000 64 63 64.000000 64 0 64.000000 64 1 64.000000 64 2 64.000000 64 3 64.000000 64 4 64.000000 64 5 64.000000 64 6 64.000000 64 7 64.000000 64 8 64.000000 64 9 64.000000 64 10 64.000000 64 11 64.000000 64 12 64.000000 64 13 64.000000 64 14 64.000000 64 15 64.000000 64 16 64.000000 64 17 64.000000 64 18 64.000000 64 19 64.000000 64 20 64.000000 64 21 64.000000 64 22 64.000000 64 23 64.000000 64 24 64.000000 64 25 64.000000 64 26 64.000000 64 27 64.000000 64 28 64.000000 64 29 64.000000 64 30 64.000000 64 31 64.000000 64 32 64.000000 64 33 64.000000 64 34 64.000000 64 35 64.000000 64 36 64.000000 64 37 64.000000 64 38 64.000000 64 39 64.000000 64 40 64.000000 64 41 64.000000 64 42 64.000000 64 43 64.000000 64 44 64.000000 64 45 64.000000 64 46 64.000000 64 47 64.000000 64 48 64.000000 64 49 64.000000 64 50 64.000000 64 51 64.000000 64 52 64.000000 64 53 64.000000 64 54 64.000000 64 55 64.000000 64 56 64.000000 64 57 64.000000 64 58 64.000000 64 59 64.000000 64 60 64.000000 64 61 64.000000 64 62 64.000000 64 63 64.000000 64 $
For new code or new maintenance, you should only use the _sync
versions.
For further examples of the usage of the mask parameter, refer to this blog