The issue is that you defined a __device__ function in separate compilation unit from __global__ that calls it. You need to either explicitely enable relocatable device code mode by adding -dc flag or move your definition to the same unit.
From nvcc documentation:
--device-c|-dc Compile each .c/.cc/.cpp/.cxx/.cu input file into an object file that contains relocatable device code. It is equivalent to
--relocatable-device-code=true --compile.
See Separate Compilation and Linking of CUDA C++ Device Code for more information.