I try to understand a formula when we should use quicksort. For instance, we have an array with N = 1_000_000 elements. If we will search only once, we sho
This actually turned into an interesting question for me as I looked into the expected runtime of a quicksort-like algorithm when the expected split at each level is not 50/50.
the first question I wanted to answer was for random data, what is the average split at each level. It surely must be greater than 50% (for the larger subdivision). Well, given an array of size N of random values, the smallest value has a subdivision of (1, N-1), the second smallest value has a subdivision of (2, N-2) and etc. I put this in a quick script:
split = 0
for x in range(10000):
split += float(max(x, 10000 - x)) / 10000
split /= 10000
print split
And got exactly 0.75 as an answer. I'm sure I could show that this is always the exact answer, but I wanted to move on to the harder part.
Now, let's assume that even 25/75 split follows an nlogn progression for some unknown logarithm base. That means that num_comparisons(n) = n * log_b(n)
and the question is to find b
via statistical means (since I don't expect that model to be exact at every step). We can do this with a clever application of least-squares fitting after we use a logarithm identity to get:
C(n) = n * log(n) / log(b)
where now the logarithm can have any base, as long as log(n)
and log(b)
use the same base. This is a linear equation just waiting for some data! So I wrote another script to generate an array of xs
and filled it with C(n)
and ys
and filled it with n*log(n)
and used numpy
to tell me the slope of that least squares fit, which I expect to equal 1 / log(b)
. I ran the script and got b
inside of [2.16, 2.3]
depending on how high I set n
to (I varied n from 100 to 100'000'000). The fact that b
seems to vary depending on n
shows that my model isn't exact, but I think that's okay for this example.
To actually answer your question now, with these assumptions, we can solve for the cutoff point of when: N * n/2 = n*log_2.3(n) + N * log_2.3(n)
. I'm just assuming that the binary search will have the same logarithm base as the sorting method for a 25/75 split. Isolating N
you get:
N = n*log_2.3(n) / (n/2 - log_2.3(n))
If your number of searches N
exceeds the quantity on the RHS (where n
is the size of the array in question) then it will be more efficient to sort once and use binary searches on that.