In-Place Radix Sort

后端 未结 15 1290
日久生厌
日久生厌 2020-12-02 03:30

This is a long text. Please bear with me. Boiled down, the question is: Is there a workable in-place radix sort algorithm?


Preliminary

相关标签:
15条回答
  • 2020-12-02 04:12

    It looks like you've solved the problem, but for the record, it appears that one version of a workable in-place radix sort is the "American Flag Sort". It's described here: Engineering Radix Sort. The general idea is to do 2 passes on each character - first count how many of each you have, so you can subdivide the input array into bins. Then go through again, swapping each element into the correct bin. Now recursively sort each bin on the next character position.

    0 讨论(0)
  • 2020-12-02 04:14

    You can certainly drop the memory requirements by encoding the sequence in bits. You are looking at permutations so, for length 2, with "ACGT" that's 16 states, or 4 bits. For length 3, that's 64 states, which can be encoded in 6 bits. So it looks like 2 bits for each letter in the sequence, or about 32 bits for 16 characters like you said.

    If there is a way to reduce the number of valid 'words', further compression may be possible.

    So for sequences of length 3, one could create 64 buckets, maybe sized uint32, or uint64. Initialize them to zero. Iterate through your very very large list of 3 char sequences, and encode them as above. Use this as a subscript, and increment that bucket.
    Repeat this until all of your sequences have been processed.

    Next, regenerate your list.

    Iterate through the 64 buckets in order, for the count found in that bucket, generate that many instances of the sequence represented by that bucket.
    when all of the buckets have been iterated, you have your sorted array.

    A sequence of 4, adds 2 bits, so there would be 256 buckets. A sequence of 5, adds 2 bits, so there would be 1024 buckets.

    At some point the number of buckets will approach your limits. If you read the sequences from a file, instead of keeping them in memory, more memory would be available for buckets.

    I think this would be faster than doing the sort in situ as the buckets are likely to fit within your working set.

    Here is a hack that shows the technique

    #include <iostream>
    #include <iomanip>
    
    #include <math.h>
    
    using namespace std;
    
    const int width = 3;
    const int bucketCount = exp(width * log(4)) + 1;
          int *bucket = NULL;
    
    const char charMap[4] = {'A', 'C', 'G', 'T'};
    
    void setup
    (
        void
    )
    {
        bucket = new int[bucketCount];
        memset(bucket, '\0', bucketCount * sizeof(bucket[0]));
    }
    
    void teardown
    (
        void
    )
    {
        delete[] bucket;
    }
    
    void show
    (
        int encoded
    )
    {
        int z;
        int y;
        int j;
        for (z = width - 1; z >= 0; z--)
        {
            int n = 1;
            for (y = 0; y < z; y++)
                n *= 4;
    
            j = encoded % n;
            encoded -= j;
            encoded /= n;
            cout << charMap[encoded];
            encoded = j;
        }
    
        cout << endl;
    }
    
    int main(void)
    {
        // Sort this sequence
        const char *testSequence = "CAGCCCAAAGGGTTTAGACTTGGTGCGCAGCAGTTAAGATTGTTT";
    
        size_t testSequenceLength = strlen(testSequence);
    
        setup();
    
    
        // load the sequences into the buckets
        size_t z;
        for (z = 0; z < testSequenceLength; z += width)
        {
            int encoding = 0;
    
            size_t y;
            for (y = 0; y < width; y++)
            {
                encoding *= 4;
    
                switch (*(testSequence + z + y))
                {
                    case 'A' : encoding += 0; break;
                    case 'C' : encoding += 1; break;
                    case 'G' : encoding += 2; break;
                    case 'T' : encoding += 3; break;
                    default  : abort();
                };
            }
    
            bucket[encoding]++;
        }
    
        /* show the sorted sequences */ 
        for (z = 0; z < bucketCount; z++)
        {
            while (bucket[z] > 0)
            {
                show(z);
                bucket[z]--;
            }
        }
    
        teardown();
    
        return 0;
    }
    
    0 讨论(0)
  • 2020-12-02 04:16

    Radix-Sort is not cache conscious and is not the fastest sort algorithm for large sets. You can look at:

    • ti7qsort. ti7qsort is the fastest sort for integers (can be used for small-fixed size strings).
    • Inline QSORT
    • String sorting

    You can also use compression and encode each letter of your DNA into 2 bits before storing into the sort array.

    0 讨论(0)
  • 2020-12-02 04:18

    First, think about the coding of your problem. Get rid of the strings, replace them by a binary representation. Use the first byte to indicate length+encoding. Alternatively, use a fixed length representation at a four-byte boundary. Then the radix sort becomes much easier. For a radix sort, the most important thing is to not have exception handling at the hot spot of the inner loop.

    OK, I thought a bit more about the 4-nary problem. You want a solution like a Judy tree for this. The next solution can handle variable length strings; for fixed length just remove the length bits, that actually makes it easier.

    Allocate blocks of 16 pointers. The least significant bit of the pointers can be reused, as your blocks will always be aligned. You might want a special storage allocator for it (breaking up large storage into smaller blocks). There are a number of different kinds of blocks:

    • Encoding with 7 length bits of variable-length strings. As they fill up, you replace them by:
    • Position encodes the next two characters, you have 16 pointers to the next blocks, ending with:
    • Bitmap encoding of the last three characters of a string.

    For each kind of block, you need to store different information in the LSBs. As you have variable length strings you need to store end-of-string too, and the last kind of block can only be used for the longest strings. The 7 length bits should be replaced by less as you get deeper into the structure.

    This provides you with a reasonably fast and very memory efficient storage of sorted strings. It will behave somewhat like a trie. To get this working, make sure to build enough unit tests. You want coverage of all block transitions. You want to start with only the second kind of block.

    For even more performance, you might want to add different block types and a larger size of block. If the blocks are always the same size and large enough, you can use even fewer bits for the pointers. With a block size of 16 pointers, you already have a byte free in a 32-bit address space. Take a look at the Judy tree documentation for interesting block types. Basically, you add code and engineering time for a space (and runtime) trade-off

    You probably want to start with a 256 wide direct radix for the first four characters. That provides a decent space/time tradeoff. In this implementation, you get much less memory overhead than with a simple trie; it is approximately three times smaller (I haven't measured). O(n) is no problem if the constant is low enough, as you noticed when comparing with the O(n log n) quicksort.

    Are you interested in handling doubles? With short sequences, there are going to be. Adapting the blocks to handle counts is tricky, but it can be very space-efficient.

    0 讨论(0)
  • 2020-12-02 04:19

    If your data set is so big, then I would think that a disk-based buffer approach would be best:

    sort(List<string> elements, int prefix)
        if (elements.Count < THRESHOLD)
             return InMemoryRadixSort(elements, prefix)
        else
             return DiskBackedRadixSort(elements, prefix)
    
    DiskBackedRadixSort(elements, prefix)
        DiskBackedBuffer<string>[] buckets
        foreach (element in elements)
            buckets[element.MSB(prefix)].Add(element);
    
        List<string> ret
        foreach (bucket in buckets)
            ret.Add(sort(bucket, prefix + 1))
    
        return ret
    

    I would also experiment grouping into a larger number of buckets, for instance, if your string was:

    GATTACA
    

    the first MSB call would return the bucket for GATT (256 total buckets), that way you make fewer branches of the disk based buffer. This may or may not improve performance, so experiment with it.

    0 讨论(0)
  • 2020-12-02 04:19

    "Radix sorting with no extra space" is a paper addressing your problem.

    0 讨论(0)
提交回复
热议问题