In CSLR page 199 they state:
Lemma 8.4: Given n b-bit numbers and any positive integer r <= b, RADIX-SORT correctly sorts these numbers in O((b/r)(n +
Partial answer here. First question - 256 would be the base number, and there would be four 8 bit digits in a 32 bit integer. Missing from the article is that it takes one read pass of the data to create a matrix of counts which is then converted into a matrix of indices (or pointers). In this case the matrix is [4][256]. After creating the matrix, then it takes 4 read / write radix sort passes to sort the dataset.
Second question - For a math based explanation, the derivative of (b/r)(n+2^r) = (b (2^r (r log(2) - 1) - n))/r^2. A minimum (or maximum) occurs when the derivative == 0, which occurs when 2^r (r log(2) - 1) - n = 0. For n == 2^20 (about 1 million), r ~= 16.606232 results in O() ~= 2212837. Some example values and O():
r O
18 2330169
17 2220514
16 2228224
15 2306867
12 2807125
8 4195328
However, due to cache issues, the optimal value for r versus n becomes smaller. On my system (Intel 2600K, 3.4ghz), for n = 2^20, r = 8 is fastest. At around n = 2^24, r = 10.67, using 3 fields 10, 11, 11 is fastest. At around n = 2^26, r = 16 is fastest. Again, due to cache issues, there's not a lot of difference in performance, less than 10% for r = 8 versus r = 16.