I have implemented a 2D median filter in CUDA and the whole program is shown below.
#include \"cuda_runtime.h\"
#include \"cuda_runtime_api.h\"
#include \"de
It seems you share nothing between threads using shared memory, i.e. for 3x3 filter, you read each pixel 9 times from the global memory, which is not necessary. This white paper may provide some ideas on how to using shared memory in a convolution kernel. Hope it help.
http://docs.nvidia.com/cuda/samples/3_Imaging/convolutionSeparable/doc/convolutionSeparable.pdf