Fastest gap sequence for shell sort?

问题

According to Marcin Ciura's Optimal (best known) sequence of increments for shell sort algorithm, the best sequence for shellsort is 1, 4, 10, 23, 57, 132, 301, 701..., but how can I generate such a sequence? In Marcin Ciura's paper, he said:

Both Knuth’s and Hibbard’s sequences are relatively bad, because they are defined by simple linear recurrences.

but most algorithm books I found tend to use Knuth’s sequence: k = 3k + 1, because it's easy to generate. What's your way of generating a shellsort sequence?

回答1:

If your data set has a definite upper bound in size, then you can hardcode the step sequence. You should probably only worry about generality if your data set is likely to grow without an upper bound.

The sequence shown seems to grow roughly as an exponential series, albeit with quirks. There seems to be a majority of prime numbers, but with non-primes in the mix as well. I don't see an obvious generation formula.

A valid question, assuming you must deal with arbitrarily large sets, is whether you need to emphasise worst-case performance, average-case performance, or almost-sorted performance. If the latter, you may find that a plain insertion sort using a binary search for the insertion step might be better than a shellsort. If you need good worst-case performance, then Sedgewick's sequence appears to be favoured. The sequence you mention is optimised for average-case performance, where the number of comparisons outweighs the number of moves.

回答2:

Ciura's paper generates the sequence empirically -- that is, he tried a bunch of combinations and this was the one that worked the best. Generating an optimal shellsort sequence has proven to be tricky, and the problem has so far been resistant to analysis.

The best known increment is Sedgewick's, which you can read about here (see p. 7).

回答3:

I would not be ashamed to take the advice given in Wikipedia's Shellsort article,

With respect to the average number of comparisons, the best known gap sequences are 1, 4, 10, 23, 57, 132, 301, 701 and similar, with gaps found experimentally. Optimal gaps beyond 701 remain unknown, but good results can be obtained by extending the above sequence according to the recursive formula h_k = \lfloor 2.25 h_{k-1} \rfloor.

Tokuda's sequence [1, 4, 9, 20, 46, 103, ...], defined by the simple formula h_k = \lceil h'_k \rceil, where h'k = 2.25h'k − 1 + 1, h'1 = 1, can be recommended for practical applications.

guessing from the pseudonym, it seems Marcin Ciura edited the WP article himself.

回答4:

The sequence is 1, 4, 10, 23, 57, 132, 301, 701, 1750. For every next number after 1750 multiply previous number by 2.25 and round down.

回答5:

I've found this sequence similar to Marcin Ciura's sequence:

1, 4, 9, 23, 57, 138, 326, 749, 1695, 3785, 8359, 18298, 39744, etc.

For example, Ciura's sequence is:

1, 4, 10, 23, 57, 132, 301, 701, 1750

This is a mean of prime numbers. Python code to find mean of prime numbers is here:

import numpy as np

def isprime(n):
    ''' Check if integer n is a prime '''
    n = abs(int(n))  # n is a positive integer
    if n < 2:  # 0 and 1 are not primes
        return False
    if n == 2:  # 2 is the only even prime number
        return True
    if not n & 1:  # all other even numbers are not primes
        return False
    # Range starts with 3 and only needs to go up the square root
    # of n for all odd numbers
    for x in range(3, int(n**0.5)+1, 2):
        if n % x == 0:
            return False
    return True

# To apply a function to a numpy array, one have to vectorize the function
vectorized_isprime = np.vectorize(isprime)

a = np.arange(10000000)
primes = a[vectorized_isprime(a)]
#print(primes)
for i in range(2,20):
    print(primes[0:2**i].mean())

The output is:

4.25
9.625
23.8125
57.84375
138.953125
326.1015625
749.04296875
1695.60742188
3785.09082031
8359.52587891
18298.4733887
39744.887085
85764.6216431
184011.130096
392925.738174
835387.635033
1769455.40302
3735498.24225

The gap in the sequence is slowly decreasing from 2.5 to 2. Maybe this association could improve the Shellsort in the future.

回答6:

I discussed this question here yesterday including the gap sequences I have found work best given a specific (low) n.

In the middle I write

A nasty side-effect of shellsort is that when using a set of random combinations of n entries (to save processing/evaluation time) to test gaps you may end up with either the best gaps for n entries or the best gaps for your set of combinations - most likely the latter.

The problem lies in testing the proposed gaps such that valid conclusions can be drawn. Obviously, testing the gaps against all n! orderings that a set of n unique values can be expressed as is unfeasible. Testing in this manner for n=16, for example, means that 20,922,789,888,000 different combinations of n values must be sorted to determine the exact average, worst and reverse-sorted cases - just to test one set of gaps and that set might not be the best. 2^(16-2) sets of gaps are possible for n=16, the first being {1} and the last {15,14,13,12,11,10,9,8,7,6,5,4,3,2,1}.

To illustrate how using random combinations might give incorrect results assume n=3 that can assume six different orderings 012, 021, 102, 120, 201 and 210. You produce a set of two random sequences to test the two possible gap sets, {1} and {2,1}. Assume that these sequences turn out to be 021 and 201. for {1} 021 can be sorted with three comparisons (02, 21 and 01) and 201 with (20, 21, 01) giving a total of six comparisons, divide by two and voilà, an average of 3 and a worst case of 3. Using {2,1} gives (01, 02, 21 and 01) for 021 and (21, 10 and 12) for 201. Seven comparisons with a worst case of 4 and an average of 3.5. The actual average and worst case for {1] is 8/3 and 3, respectively. For {2,1} the values are 10/3 and 4. The averages were too high in both cases and the worst cases were correct. Had 012 been one of the cases {1} would have given a 2.5 average - too low.

Now extend this to finding a set of random sequences for n=16 such that no set of gaps tested will be favored in comparison with the others and the result close (or equal) to the true values, all the while keeping processing to a minimum. Can it be done? Possibly. After all, everything is possible - but is it probable? I think that for this problem random is the wrong approach. Selecting the sequences according to some system may be less bad and might even be good.

来源：https://stackoverflow.com/questions/2539545/fastest-gap-sequence-for-shell-sort

标签

algorithm

performance

sorting

shellsort