Edit: The code here still has some bugs in it, and it could do better in the performance department, but instead of trying to fix this, for t
It looks like your implementation assumes that sizeof(size_t) == sizeof(float). Will that always be true for your target platforms?
sizeof(size_t) == sizeof(float)
And I wouldn't say threading heresy so much as casting heresy. :)