I am trying to implement the Cyclic reduction algorithm described in the paper "Fast tridiagonal solvers on GPU"(https://www.jcohen.name/papers/Zhang_Fast_2009.pdf