In-Place Radix Sort

后端 未结 15 1289
日久生厌
日久生厌 2020-12-02 03:30

This is a long text. Please bear with me. Boiled down, the question is: Is there a workable in-place radix sort algorithm?


Preliminary

相关标签:
15条回答
  • 2020-12-02 04:19

    You'll want to take a look at Large-scale Genome Sequence Processing by Drs. Kasahara and Morishita.

    Strings comprised of the four nucleotide letters A, C, G, and T can be specially encoded into Integers for much faster processing. Radix sort is among many algorithms discussed in the book; you should be able to adapt the accepted answer to this question and see a big performance improvement.

    0 讨论(0)
  • 2020-12-02 04:21

    I'm going to go out on a limb and suggest you switch to a heap/heapsort implementation. This suggestion comes with some assumptions:

    1. You control the reading of the data
    2. You can do something meaningful with the sorted data as soon as you 'start' getting it sorted.

    The beauty of the heap/heap-sort is that you can build the heap while you read the data, and you can start getting results the moment you have built the heap.

    Let's step back. If you are so fortunate that you can read the data asynchronously (that is, you can post some kind of read request and be notified when some data is ready), and then you can build a chunk of the heap while you are waiting for the next chunk of data to come in - even from disk. Often, this approach can bury most of the cost of half of your sorting behind the time spent getting the data.

    Once you have the data read, the first element is already available. Depending on where you are sending the data, this can be great. If you are sending it to another asynchronous reader, or some parallel 'event' model, or UI, you can send chunks and chunks as you go.

    That said - if you have no control over how the data is read, and it is read synchronously, and you have no use for the sorted data until it is entirely written out - ignore all this. :(

    See the Wikipedia articles:

    • Heapsort
    • Binary heap
    0 讨论(0)
  • 2020-12-02 04:23

    Well, here's a simple implementation of an MSD radix sort for DNA. It's written in D because that's the language that I use most and therefore am least likely to make silly mistakes in, but it could easily be translated to some other language. It's in-place but requires 2 * seq.length passes through the array.

    void radixSort(string[] seqs, size_t base = 0) {
        if(seqs.length == 0)
            return;
    
        size_t TPos = seqs.length, APos = 0;
        size_t i = 0;
        while(i < TPos) {
            if(seqs[i][base] == 'A') {
                 swap(seqs[i], seqs[APos++]);
                 i++;
            }
            else if(seqs[i][base] == 'T') {
                swap(seqs[i], seqs[--TPos]);
            } else i++;
        }
    
        i = APos;
        size_t CPos = APos;
        while(i < TPos) {
            if(seqs[i][base] == 'C') {
                swap(seqs[i], seqs[CPos++]);
            }
            i++;
        }
        if(base < seqs[0].length - 1) {
            radixSort(seqs[0..APos], base + 1);
            radixSort(seqs[APos..CPos], base + 1);
            radixSort(seqs[CPos..TPos], base + 1);
            radixSort(seqs[TPos..seqs.length], base + 1);
       }
    }
    

    Obviously, this is kind of specific to DNA, as opposed to being general, but it should be fast.

    Edit:

    I got curious whether this code actually works, so I tested/debugged it while waiting for my own bioinformatics code to run. The version above now is actually tested and works. For 10 million sequences of 5 bases each, it's about 3x faster than an optimized introsort.

    0 讨论(0)
提交回复
热议问题