What is the best way to programmatically choose the best GPU in OpenCL?

问题

On my laptop I have two graphic cards - Intel Iris and Nvidia GeForce GT 750M. I am trying to do a simple vector add using OpenCL. I know, that Nvidia card is much faster and can do the job better. In principle, I can put an if statement in the code that will look for NVIDIA in the VENDOR attribute. But I'd like to have something elegant. What is the best way to choose a better (faster) GPU programmatically in OpenCL C/C++?

回答1:

I developed a real-time ray tracer (not just a ray caster) which programmatically chose two GPUs and a CPU and rendered and balanced the load on all three in real time. Here is how I did it.

Let's say there are three devices, d1, d2, and d3. Assign each device a weight: w1, w2, and w3. Call the number of pixels to be rendered n. Assume a free parameter called alpha.

Assign each device a weight of 1/3.
Let alpha = 0.5.
Render the first n1=w1*n pixels on d1, the next n2=w2*n pixels on d2, and the last n3=w3*n pixels on d3 and record the times to render for each deivce t1, t2, and t3.
Calculate a value vsum = n1/t1 + n2/t2 + n3/t3.
Recalcuate the weights w_i = alpha*w_i + (1-alpha)*n_i/t_i/vsum.
Go back to step 3.

The point of the value alpha is to allow a smooth transition. Instead of reassign all the weight based on the times it mixes in some of the old weight. Without using alpha I got instabilities. The value alpha can be tuned. In practice it can probably be set around 1% but not 0%.

Let's choose an example.

I had a GTX 590 which was a dual GPU card with two under-clocked GTX580s. I also had a Sandy Bridge 2600K processor. The GPUs were much faster than the CPU. Let's assume they were about 10 times faster. Let's also say there were 900 pixels.

Render the first 300 pixels with GPU1, the next 300 pixels with GPU2, and the last 300 pixels with CPU1 and record the times of 10 s, 10 s, and 100 s respectively. So one GPU for the whole image would take 30 s and the CPU alone would take 300 s. Both GPUS together would take 15 s.

Calculate vsum = 30 + 30 + 3 = 63. Recalculate the weights again: w1,w2 = 0.5*(1/3) + 0.5*300/10/63 = 0.4 and w3 = 0.5*(1/3) + 0.5*300/100/63 = 0.2.

Render the next frame: 360 pixels with GPU1, 360 PIXELS with GPU2, and 180 PIXELS with CPU1 and the times become a bit more balanced say 11 s, 11 s, and 55 s.

After a number of frames the (1-alpha) term dominates until eventually the weights are all based on that term. In this case the weights become 47% (427 pixels), 47%, 6% (46 pixels) respectively and the times become say 14 s, 14 s, 14 s respectively. In this case the CPU only improves the result of using only the GPUs by one second.

I assumed a uniform load in this calculate. In a real ray tracer the load varies per scan-line and pixel but the algorithm stays the same for determining the weights.

In practice once the weights are found they don't change much unless the load of the scene changes significantly e.g. if one region of the scene has high refraction and reflection and the rest is diffuse but even in this case I limit the tree depth so this does not have a dramatic effect.

It's easy to extend this method to multiple devices with a loop. I tested my ray tracer on four devices once. Two 12-core Xeon CPUs and two GPUs. In this case the CPUs had a lot more influence but the GPUs still dominated.

In case anyone is wondering. I created a context for each device and used each context in a separate thread (using pthreads). For three devices I used three threads.

In fact you can use this to run on the same device from different vendors. For example I used both the AMD and Intel CPU drivers simultaneously (each generating about half the frame) on my 2600K to see which vendor was better. When I first did this (2012), if I recall correctly, AMD beat Intel, ironically, on an Intel CPU.

In case anyone is interested in how I came up with the formula for the weights I used an idea from physics (my background is physics not programming).

Speed (v) = distance/time. In this case distance (d) is the number of pixels to process. The total distance then is

d = v1*t1 + v2*t2 + v3*t3

and we want them to each finish in the same time so

d = (v1 + v2 + v3)*t

then to get the weight define

v_i*t = w_i*d

which gives

w_i = v_i*t/d

and replacing (t/d) from (d = (v1 + v2 + v3)*t) gives:

w_i = v_i /(v1 + v2 + v3)

It's easy to see this can be generalized to any number of devices k

w_i = v_i/(v1 + v2 + ...v_k)

So vsum in my algorithm stands for "sum of the velocities". Lastly since v_i is pixels over time it's n_i/t_i which finally gives

w_i = n_i/t_i/(n1/t1 + n2/t2 + ...n_k/t_k)

which is the second term in my formula to calculate the weights.

回答2:

If it is simply a vector add and your app resides in host-side then cpu will win. Or even better, integrated cpu will be much faster. Overall performance depends on algortihms, opencl buffer types(use_host_ptr, read_write, etc) and compute to data ratio. Even if you dont copy but pin the array and access, cpu's latency would be smaller than pci-e latency.

If you are going to use opengl + opencl interop, then you will need to know if your compute device is the same device with your rendering output device. (if your screen gets its data from igpu then it is iris, if not then it is nvidia)

If you just need to do some operations on c++ arrays(host side) and get results with fastest way then I suggest you the "load balancing".

Example of vector-add of 4k elements on a Core i7-5775C with Iris pro and two gt750m (one overclocked by 10%)

First, give equal number of ndrange rages to all devices. At the end of each calculation phase, check timings.

CPU      iGPU        dGPU-1        dGPU-2 oc
Intel    Intel       Nvidia        Nvidia  
1024     1024        1024          1024  
34 ms    5ms         10ms          9ms

then calculate weighted(depends on last ndrange range) but relaxed(not exact but close) approximations of calculation bandwidths and change ndrange ranges accordingly:

Intel    Intel       Nvidia        Nvidia 
512      1536        1024          1024  
16 ms    8ms         10ms          9ms

then continue calculating until it really becomes stable.

Intel    Intel       Nvidia        Nvidia 
256      1792        1024          1024  
9ms      10ms         10ms         9ms

or until you can enable finer grains.

Intel    Intel       Nvidia        Nvidia 
320      1728        1024          1024  
10ms     10ms        10ms          9ms 

Intel    Intel       Nvidia        Nvidia  
320      1728        960           1088  
10ms     10ms        10ms          10ms 

         ^            ^
         |            |
         |            PCI-E bandwidth not more than 16 GB/s per device
        closer to RAM, better bandwidth (20-40 GB/s) and less kernel overhead

Instead of getting just the latest iteration for balancing, you can get average(or PID) of last 10 results to eliminate spikes that mislead balancing. Also buffer copies can take more time than computing, if you include this into balancing, you can shut down unnecessary / non benefiting devices.

If you make a library, then you wont have to try benchmark for every new project of yours. They will be auto balanced between devices when you accelerate matrix multiplications, fluid movements, sql table joins and financial approximations.

For the solution of balancing:

If you can solve a linear system as n unknowns(of loads per device) and n equations (benchmark result of all devices), you can find the target loads in single step. If you choose iterative, you need more steps until it converges. The latter is not harder than writing a benchmark. The former is harder for me but it should be more efficient over time.

Althogh a vector-add-only kernel is not a real world scenario, here is a real benchmark from my system:

 __kernel void bench(__global float * a, __global float *b, __global float *c)
                {
                    int i=get_global_id(0);
                    c[i]=a[i]+b[i];  
                }
2560   768 768
AMD FX(tm)-8150 Eight-Core Processor            Oland Pitcairn

this is after several iterations(fx is faster even with extra buffer copies, not using any host pointer). Even the oland gpu is catching pitcairn because their pci-e bandwidth is same.

Now with some trigonometric functions:

  __kernel void bench(__global float * a, __global float *b, __global float *c)
  {
        int i=get_global_id(0);
        c[i]=sin(a[i])+cos(b[i])+sin(cos((float)i));  
  }

   1792   1024 1280

testing gddr3-128bit vs gddr5-256bit(overclocked) and caching.

__kernel void bench(__global float * a, __global float *b, __global float *c)
{
                    int i=get_global_id(0);

                    c[i]=a[i]+b[i]-a[i]-b[i]+a[i]+b[i]-b[i]-a[i]+b[i]; 
                    for(int j=0;j<12000;j++)
                        c[i]+=a[i]+b[i]-a[i]-b[i]+a[i]+b[i]-b[i]-a[i]+b[i]; 

 }



 256   256 3584

High compute to data ratio :

__kernel void bench(__global float * a, __global float *b, __global float *c)
            {
                int i=get_global_id(0);

                c[i]=0.0f; float c0=c[i];float a0=a[i];float b0=b[i];
                for(int j=0;j<12000;j++)
                    c0+=sin(a0)+cos(b0*a0)+cos(sin(b0)*19.95f); 
                c[i]=c0;

            }

256   2048 1792

Now Oland gpu is worthy again and won even with just 320 cores. Because 4k elements easily wrapped around all 320 cores for more than 10 times but pitcairn gpu(1280 cores) was not fully filled with folded arrays (wavefronts) enough and this led to lower occupation of execution units ---> could not hide latencies. Low end devices for low loads is better I think. Maybe I could use this when directx-12 comes out with some loadbalancer and this Oland can compute physics of 5000 - 10000 particles from in game explosions while pitcairn can compute smoke densities.

回答3:

Good: Just pick the first compatible device. On most systems, there is only one.

Better: You can very roughly estimate device performance by multiplying the CL_DEVICE_MAX_COMPUTE_UNITS device info result by the CL_DEVICE_MAX_CLOCK_FREQUENCY device info result. Depending on your workload, you might want to include other metrics such as memory size. You could blend these based on what your workload is.

Best: Benchmark with your exact workflow on each device. It's really the only way to know for sure, since anything else is just a guess.

Finally, the user might care about which of their GPUs you are using, so you should have some way to override your programmatic choice regardless of which method you choose.

来源：https://stackoverflow.com/questions/33333468/what-is-the-best-way-to-programmatically-choose-the-best-gpu-in-opencl

标签

c++

opencl

gpgpu