I’m trying to implement a runs-based kernel using Numba where I need to traverse the elements of a 3D matrix on a row per thread basis i.e. each thread is assigned a row tha