As the title, I have tried to capture data at GPU runtime with using the following tools. But none of them have a unified standard to assess the performance of cs.