Does OpenCL support a randomly accessed global queue buffer?

问题

I am writing a kernel which processes combinatorial data. Because these sorts of problems generally have a large problem space, where most of the processed data is junk, is there a way I could do the following:

(1) If the calculated data passes some sort of condition, it is put onto a global output buffer.

(2) Once the output buffer is full, the data is sent back to the host

(3) The host takes a copy of the data from the buffer and clears it

(4) Then creates a new buffer to be filled by the GPU

For simplicity, this example could be stated as a selective inner product and I mean that by

__global int buffer_counter; // Counts 

void put_onto_output_buffer(float value, __global float *buffer, int size)
{
    // Put this value onto the global buffer or send a signal to the host
}

__kernel void
inner_product(
    __global const float *threshold,       // threshold
    __global const float *first_vector,    // 10000 float vector
    __global const float *second_vector,   // 10000 float vector
    __global float *output_buffer,         // 100 float vector
    __global const int *output_buffer_size // size of the output buffer -- 100
{
    int id = get_global_id(0);
    float value = first_vector[id] * second_vector[id];
    if (value >= threshold[0])
        put_onto_output_buffer(value, output_buffer, output_buffer_size[0]); 
}

回答1:

It depends on the frequency of output. If it is high frequency (a work item writes output more often than not) then buffer_counter will be a source of contention and will cause slow downs (also, by the way, it will need to be updated using atomic methods, which is why it's slow). It this case you're better off just always writing output and sort through the real ones later.

On the other hand, if writing output is fairly infrequent, then using an atomic position indicator makes good sense. The majority of work items will do their computation, decide they have no output, and retire. Only the infrequent ones that have output will contend over the atomic output position index, serially increment it, and write their output at their unique location. Your output memory will compactly contain the results (in no particular order so store the work item ID if you care).

Again, do read up on atomics because the index needs to be atomic.

来源：https://stackoverflow.com/questions/22246965/does-opencl-support-a-randomly-accessed-global-queue-buffer

标签

kernel

opencl