How to calculate Gflops of a kernel

前端 未结 2 1338
轮回少年
轮回少年 2020-11-30 04:05

I want a measure of how much of the peak performance my kernel archives.

Say I have a NVIDIA Tesla C1060, which has a peak GFLOPS of 622.08 (~= 240Cores * 1300MHz * 2

2条回答
  •  时光取名叫无心
    2020-11-30 04:43

    Nsight VSE (>3.2) and the Visual Profiler (>=5.5) support Achieved FLOPs calculation. In order to collect the metric the profilers run the kernel twice (using kernel replay). In the first replay the number of floating point instructions executed is collected (with understanding of predication and active mask). in the second replay the duration is collected.

    nvprof and Visual Profiler have a hardcoded definition. FMA counts as 2 operations. All other operations are 1 operation. The flops_sp_* counters are thread instruction execution counts whereas flops_sp is the weighted sum so some weighting can be applied using the individual metrics. However, flops_sp_special covers a number of different instructions.

    The Nsight VSE experiment configuration allows the user to define the operations per instruction type.

    Nsight Visual Studio Edition

    Configuring to collect Achieved FLOPS

    1. Execute the menu command Nsight > Start Performance Analysis... to open the Activity Editor
    2. Set Activity Type to Profile CUDA Application
    3. In Experiment Settings set Experiments to Run to Custom
    4. In the Experiment List add Achieved FLOPS
    5. In the middle pane select Achieved FLOPS
    6. In the right pane you can custom the FLOPS per instruction executed. The default weighting is for FMA and RSQ to count as 2. In some cases I have seen RSQ as high as 5.
    7. Run the Analysis Session.

    Nsight VSE Achieved FLOPS Experiment Configuration

    Viewing Achieved FLOPS

    1. In the nvreport open the CUDA Launches report page.
    2. In the CUDA Launches page select a kernel.
    3. In the report correlation pane (bottom left) select Achieved FLOPS

    Nsight VSE Achieved FLOPS Results

    nvprof

    Metrics Available (on a K20)

    nvprof --query-metrics | grep flop
    flops_sp:            Number of single-precision floating-point operations executed by non-predicated threads (add, multiply, multiply-accumulate and special)
    flops_sp_add:        Number of single-precision floating-point add operations executed by non-predicated threads
    flops_sp_mul:        Number of single-precision floating-point multiply operations executed by non-predicated threads
    flops_sp_fma:        Number of single-precision floating-point multiply-accumulate operations executed by non-predicated threads
    flops_dp:            Number of double-precision floating-point operations executed non-predicated threads (add, multiply, multiply-accumulate and special)
    flops_dp_add:        Number of double-precision floating-point add operations executed by non-predicated threads
    flops_dp_mul:        Number of double-precision floating-point multiply operations executed by non-predicated threads
    flops_dp_fma:        Number of double-precision floating-point multiply-accumulate operations executed by non-predicated threads
    flops_sp_special:    Number of single-precision floating-point special operations executed by non-predicated threads
    flop_sp_efficiency:  Ratio of achieved to peak single-precision floating-point operations
    flop_dp_efficiency:  Ratio of achieved to peak double-precision floating-point operations
    

    Collection and Results

    nvprof --devices 0 --metrics flops_sp --metrics flops_sp_add --metrics flops_sp_mul --metrics flops_sp_fma matrixMul.exe
    [Matrix Multiply Using CUDA] - Starting...
    ==2452== NVPROF is profiling process 2452, command: matrixMul.exe
    GPU Device 0: "Tesla K20c" with compute capability 3.5
    
    MatrixA(320,320), MatrixB(640,320)
    Computing result using CUDA Kernel...
    done
    Performance= 6.18 GFlop/s, Time= 21.196 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
    Checking computed result for correctness: OK
    
    Note: For peak performance, please refer to the matrixMulCUBLAS example.
    ==2452== Profiling application: matrixMul.exe
    ==2452== Profiling result:
    ==2452== Metric result:
    Invocations                               Metric Name                        Metric Description         Min         Max         Avg
    Device "Tesla K20c (0)"
            Kernel: void matrixMulCUDA(float*, float*, float*, int, int)
            301                                  flops_sp                             FLOPS(Single)   131072000   131072000   131072000
            301                              flops_sp_add                         FLOPS(Single Add)           0           0           0
            301                              flops_sp_mul                         FLOPS(Single Mul)           0           0           0
            301                              flops_sp_fma                         FLOPS(Single FMA)    65536000    65536000    65536000
    

    NOTE: flops_sp = flops_sp_add + flops_sp_mul + flops_sp_special + (2 * flops_sp_fma) (approximately)

    Visual Profiler

    The Visual Profiler supports the metrics shown in the nvprof section above.

提交回复
热议问题