问题
My intended program flow would look like the following if it were possible:
typedef struct structure_t
{
[...]
/* device function pointer. */
__device__ float (*function_pointer)(float, float, float[]);
[...]
} structure;
[...]
/* function to be assigned. */
__device__ float
my_function (float a, float b, float c[])
{
/* do some stuff on the device. */
[...]
}
void
some_structure_initialization_function (structure *st)
{
/* assign. */
st->function_pointer = my_function;
[...]
}
This is not possible, and ends in a familiar error during compilation regarding the placement of __device__ in the structure.
error: attribute "device" does not apply here
There are some examples of similar types of problems here on stackoverflow, but they all involve the use of static pointers outside the structure. Examples are device function pointers as struct members and device function pointers. I've taken a similar approach with success previously in other codes where it's easy for me to use static device pointers and define them outside of any structures. Currently though this is a problem. It's written as an API of sorts and the user may define one or two or dozens of structures which need to include a device function pointer. So, defining static device pointers outside of the structure is a major problem.
I'm fairly certain the answer exists within the posts I have linked above, through use symbol copies, but I've not been able to put them to successful use.
回答1:
What you are trying to do is possible, but you have made a few mistakes in the way you are declaring and defining the structures that will hold and use the function pointer.
This is not possible, and ends in a familiar error during compilation regarding the placement of __device__ in the structure.
error: attribute "device" does not apply here
This is only because you are attempting to assign a memory space to a structure or class data member, which is illegal in CUDA. The memory space of the all class or structure data members are implicitly set when you define or instantiate a class. So something only slighlty different (and more concrete):
typedef float (* fp)(float, float, float4);
struct functor
{
float c0, c1;
fp f;
__device__ __host__
functor(float _c0, float _c1, fp _f) : c0(_c0), c1(_c1), f(_f) {};
__device__ __host__
float operator()(float4 x) { return f(c0, c1, x); };
};
__global__
void kernel(float c0, float c1, fp f, const float4 * x, float * y, int N)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
struct functor op(c0, c1, f);
for(int i = tid; i < N; i += blockDim.x * gridDim.x) {
y[i] = op(x[i]);
}
}
is perfectly valid. The function pointer fp
in functor
is implicitly a __device__
function when an instance of functor
is instantiated in device code. If it were instantiated in host code, the function pointer would implicitly be a host function. In the kernel, a device function pointer passed as argument is used to instantiate a functor
instance. All perfectly legal.
I believe I am correct in saying that there is no direct way to get the address of a __device__
function in host code, so you still require some static declarations and symbol manipulation. This might be different in CUDA 5, but I have not tested it to see. If we flesh out the device code above with a couple of __device__
functions and some supporting host code:
__device__ __host__
float f1 (float a, float b, float4 c)
{
return a + (b * c.x) + (b * c.y) + (b * c.z) + (b * c.w);
}
__device__ __host__
float f2 (float a, float b, float4 c)
{
return a + b + c.x + c.y + c.z + c.w;
}
__constant__ fp function_table[] = {f1, f2};
int main(void)
{
const float c1 = 1.0f, c2 = 2.0f;
const int n = 20;
float4 vin[n];
float vout1[n], vout2[n];
for(int i=0, j=0; i<n; i++) {
vin[i].x = j++; vin[i].y = j++;
vin[i].z = j++; vin[i].w = j++;
}
float4 * _vin;
float * _vout1, * _vout2;
size_t sz4 = sizeof(float4) * size_t(n);
size_t sz1 = sizeof(float) * size_t(n);
cudaMalloc((void **)&_vin, sz4);
cudaMalloc((void **)&_vout1, sz1);
cudaMalloc((void **)&_vout2, sz1);
cudaMemcpy(_vin, &vin[0], sz4, cudaMemcpyHostToDevice);
fp funcs[2];
cudaMemcpyFromSymbol(&funcs, "function_table", 2 * sizeof(fp));
kernel<<<1,32>>>(c1, c2, funcs[0], _vin, _vout1, n);
cudaMemcpy(&vout1[0], _vout1, sz1, cudaMemcpyDeviceToHost);
kernel<<<1,32>>>(c1, c2, funcs[1], _vin, _vout2, n);
cudaMemcpy(&vout2[0], _vout2, sz1, cudaMemcpyDeviceToHost);
struct functor func1(c1, c2, f1), func2(c1, c2, f2);
for(int i=0; i<n; i++) {
printf("%2d %6.f %6.f (%6.f,%6.f,%6.f,%6.f ) %6.f %6.f %6.f %6.f\n",
i, c1, c2, vin[i].x, vin[i].y, vin[i].z, vin[i].w,
vout1[i], func1(vin[i]), vout2[i], func2(vin[i]));
}
return 0;
}
you get a fully compilable and runnable example. Here two __device__
functions and a static function table provide a mechanism for the host code to retrieve __device__
function pointers at runtime. The kernel is called once with each __device__
function and the results displayed, along with the exact same functor and functions instantiated and called from host code (and thus running on the host) for comparison:
$ nvcc -arch=sm_30 -Xptxas="-v" -o function_pointer function_pointer.cu
ptxas info : Compiling entry function '_Z6kernelffPFfff6float4EPKS_Pfi' for 'sm_30'
ptxas info : Function properties for _Z6kernelffPFfff6float4EPKS_Pfi
16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for _Z2f1ff6float4
24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for _Z2f2ff6float4
24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 16 registers, 356 bytes cmem[0], 16 bytes cmem[3]
$ ./function_pointer
0 1 2 ( 0, 1, 2, 3 ) 13 13 9 9
1 1 2 ( 4, 5, 6, 7 ) 45 45 25 25
2 1 2 ( 8, 9, 10, 11 ) 77 77 41 41
3 1 2 ( 12, 13, 14, 15 ) 109 109 57 57
4 1 2 ( 16, 17, 18, 19 ) 141 141 73 73
5 1 2 ( 20, 21, 22, 23 ) 173 173 89 89
6 1 2 ( 24, 25, 26, 27 ) 205 205 105 105
7 1 2 ( 28, 29, 30, 31 ) 237 237 121 121
8 1 2 ( 32, 33, 34, 35 ) 269 269 137 137
9 1 2 ( 36, 37, 38, 39 ) 301 301 153 153
10 1 2 ( 40, 41, 42, 43 ) 333 333 169 169
11 1 2 ( 44, 45, 46, 47 ) 365 365 185 185
12 1 2 ( 48, 49, 50, 51 ) 397 397 201 201
13 1 2 ( 52, 53, 54, 55 ) 429 429 217 217
14 1 2 ( 56, 57, 58, 59 ) 461 461 233 233
15 1 2 ( 60, 61, 62, 63 ) 493 493 249 249
16 1 2 ( 64, 65, 66, 67 ) 525 525 265 265
17 1 2 ( 68, 69, 70, 71 ) 557 557 281 281
18 1 2 ( 72, 73, 74, 75 ) 589 589 297 297
19 1 2 ( 76, 77, 78, 79 ) 621 621 313 313
If I have understood your question correctly, the above example should give you pretty much all the design patterns you need to implement your ideas in device code.
来源:https://stackoverflow.com/questions/11857045/cuda-device-function-pointers-in-structure-without-static-pointers-or-symbol-cop