I am transferring my CPU code into GPU. While I was optimizing it, I found a controversial performance behavior:
Consider a simple task of calculating vector\'s L2 no