In-Place Radix Sort

故事扮演 提交于 2019-11-27 09:57:24
dsimcha

Well, here's a simple implementation of an MSD radix sort for DNA. It's written in D because that's the language that I use most and therefore am least likely to make silly mistakes in, but it could easily be translated to some other language. It's in-place but requires 2 * seq.length passes through the array.

void radixSort(string[] seqs, size_t base = 0) {
    if(seqs.length == 0)
        return;

    size_t TPos = seqs.length, APos = 0;
    size_t i = 0;
    while(i < TPos) {
        if(seqs[i][base] == 'A') {
             swap(seqs[i], seqs[APos++]);
             i++;
        }
        else if(seqs[i][base] == 'T') {
            swap(seqs[i], seqs[--TPos]);
        } else i++;
    }

    i = APos;
    size_t CPos = APos;
    while(i < TPos) {
        if(seqs[i][base] == 'C') {
            swap(seqs[i], seqs[CPos++]);
        }
        i++;
    }
    if(base < seqs[0].length - 1) {
        radixSort(seqs[0..APos], base + 1);
        radixSort(seqs[APos..CPos], base + 1);
        radixSort(seqs[CPos..TPos], base + 1);
        radixSort(seqs[TPos..seqs.length], base + 1);
   }
}

Obviously, this is kind of specific to DNA, as opposed to being general, but it should be fast.

Edit:

I got curious whether this code actually works, so I tested/debugged it while waiting for my own bioinformatics code to run. The version above now is actually tested and works. For 10 million sequences of 5 bases each, it's about 3x faster than an optimized introsort.

I've never seen an in-place radix sort, and from the nature of the radix-sort I doubt that it is much faster than a out of place sort as long as the temporary array fits into memory.

Reason:

The sorting does a linear read on the input array, but all writes will be nearly random. From a certain N upwards this boils down to a cache miss per write. This cache miss is what slows down your algorithm. If it's in place or not will not change this effect.

I know that this will not answer your question directly, but if sorting is a bottleneck you may want to have a look at near sorting algorithms as a preprocessing step (the wiki-page on the soft-heap may get you started).

That could give a very nice cache locality boost. A text-book out-of-place radix sort will then perform better. The writes will still be nearly random but at least they will cluster around the same chunks of memory and as such increase the cache hit ratio.

I have no idea if it works out in practice though.

Btw: If you're dealing with DNA strings only: You can compress a char into two bits and pack your data quite a lot. This will cut down the memory requirement by factor four over a naiive representation. Addressing becomes more complex, but the ALU of your CPU has lots of time to spend during all the cache-misses anyway.

EvilTeach

You can certainly drop the memory requirements by encoding the sequence in bits. You are looking at permutations so, for length 2, with "ACGT" that's 16 states, or 4 bits. For length 3, that's 64 states, which can be encoded in 6 bits. So it looks like 2 bits for each letter in the sequence, or about 32 bits for 16 characters like you said.

If there is a way to reduce the number of valid 'words', further compression may be possible.

So for sequences of length 3, one could create 64 buckets, maybe sized uint32, or uint64. Initialize them to zero. Iterate through your very very large list of 3 char sequences, and encode them as above. Use this as a subscript, and increment that bucket.
Repeat this until all of your sequences have been processed.

Next, regenerate your list.

Iterate through the 64 buckets in order, for the count found in that bucket, generate that many instances of the sequence represented by that bucket.
when all of the buckets have been iterated, you have your sorted array.

A sequence of 4, adds 2 bits, so there would be 256 buckets. A sequence of 5, adds 2 bits, so there would be 1024 buckets.

At some point the number of buckets will approach your limits. If you read the sequences from a file, instead of keeping them in memory, more memory would be available for buckets.

I think this would be faster than doing the sort in situ as the buckets are likely to fit within your working set.

Here is a hack that shows the technique

#include <iostream>
#include <iomanip>

#include <math.h>

using namespace std;

const int width = 3;
const int bucketCount = exp(width * log(4)) + 1;
      int *bucket = NULL;

const char charMap[4] = {'A', 'C', 'G', 'T'};

void setup
(
    void
)
{
    bucket = new int[bucketCount];
    memset(bucket, '\0', bucketCount * sizeof(bucket[0]));
}

void teardown
(
    void
)
{
    delete[] bucket;
}

void show
(
    int encoded
)
{
    int z;
    int y;
    int j;
    for (z = width - 1; z >= 0; z--)
    {
        int n = 1;
        for (y = 0; y < z; y++)
            n *= 4;

        j = encoded % n;
        encoded -= j;
        encoded /= n;
        cout << charMap[encoded];
        encoded = j;
    }

    cout << endl;
}

int main(void)
{
    // Sort this sequence
    const char *testSequence = "CAGCCCAAAGGGTTTAGACTTGGTGCGCAGCAGTTAAGATTGTTT";

    size_t testSequenceLength = strlen(testSequence);

    setup();


    // load the sequences into the buckets
    size_t z;
    for (z = 0; z < testSequenceLength; z += width)
    {
        int encoding = 0;

        size_t y;
        for (y = 0; y < width; y++)
        {
            encoding *= 4;

            switch (*(testSequence + z + y))
            {
                case 'A' : encoding += 0; break;
                case 'C' : encoding += 1; break;
                case 'G' : encoding += 2; break;
                case 'T' : encoding += 3; break;
                default  : abort();
            };
        }

        bucket[encoding]++;
    }

    /* show the sorted sequences */ 
    for (z = 0; z < bucketCount; z++)
    {
        while (bucket[z] > 0)
        {
            show(z);
            bucket[z]--;
        }
    }

    teardown();

    return 0;
}

If your data set is so big, then I would think that a disk-based buffer approach would be best:

sort(List<string> elements, int prefix)
    if (elements.Count < THRESHOLD)
         return InMemoryRadixSort(elements, prefix)
    else
         return DiskBackedRadixSort(elements, prefix)

DiskBackedRadixSort(elements, prefix)
    DiskBackedBuffer<string>[] buckets
    foreach (element in elements)
        buckets[element.MSB(prefix)].Add(element);

    List<string> ret
    foreach (bucket in buckets)
        ret.Add(sort(bucket, prefix + 1))

    return ret

I would also experiment grouping into a larger number of buckets, for instance, if your string was:

GATTACA

the first MSB call would return the bucket for GATT (256 total buckets), that way you make fewer branches of the disk based buffer. This may or may not improve performance, so experiment with it.

Joe

I'm going to go out on a limb and suggest you switch to a heap/heapsort implementation. This suggestion comes with some assumptions:

  1. You control the reading of the data
  2. You can do something meaningful with the sorted data as soon as you 'start' getting it sorted.

The beauty of the heap/heap-sort is that you can build the heap while you read the data, and you can start getting results the moment you have built the heap.

Let's step back. If you are so fortunate that you can read the data asynchronously (that is, you can post some kind of read request and be notified when some data is ready), and then you can build a chunk of the heap while you are waiting for the next chunk of data to come in - even from disk. Often, this approach can bury most of the cost of half of your sorting behind the time spent getting the data.

Once you have the data read, the first element is already available. Depending on where you are sending the data, this can be great. If you are sending it to another asynchronous reader, or some parallel 'event' model, or UI, you can send chunks and chunks as you go.

That said - if you have no control over how the data is read, and it is read synchronously, and you have no use for the sorted data until it is entirely written out - ignore all this. :(

See the Wikipedia articles:

Performance-wise you might want to look at a more general string-comparison sorting algorithms.

Currently you wind up touching every element of every string, but you can do better!

In particular, a burst sort is a very good fit for this case. As a bonus, since burstsort is based on tries, it works ridiculously well for the small alphabet sizes used in DNA/RNA, since you don't need to build any sort of ternary search node, hash or other trie node compression scheme into the trie implementation. The tries may be useful for your suffix-array-like final goal as well.

A decent general purpose implementation of burstsort is available on source forge at http://sourceforge.net/projects/burstsort/ - but it is not in-place.

For comparison purposes, The C-burstsort implementation covered at http://www.cs.mu.oz.au/~rsinha/papers/SinhaRingZobel-2006.pdf benchmarks 4-5x faster than quicksort and radix sorts for some typical workloads.

You'll want to take a look at Large-scale Genome Sequence Processing by Drs. Kasahara and Morishita.

Strings comprised of the four nucleotide letters A, C, G, and T can be specially encoded into Integers for much faster processing. Radix sort is among many algorithms discussed in the book; you should be able to adapt the accepted answer to this question and see a big performance improvement.

eig

"Radix sorting with no extra space" is a paper addressing your problem.

You might try using a trie. Sorting the data is simply iterating through the dataset and inserting it; the structure is naturally sorted, and you can think of it as similar to a B-Tree (except instead of making comparisons, you always use pointer indirections).

Caching behavior will favor all of the internal nodes, so you probably won't improve upon that; but you can fiddle with the branching factor of your trie as well (ensure that every node fits into a single cache line, allocate trie nodes similar to a heap, as a contiguous array that represents a level-order traversal). Since tries are also digital structures (O(k) insert/find/delete for elements of length k), you should have competitive performance to a radix sort.

I would burstsort a packed-bit representation of the strings. Burstsort is claimed to have much better locality than radix sorts, keeping the extra space usage down with burst tries in place of classical tries. The original paper has measurements.

Radix-Sort is not cache conscious and is not the fastest sort algorithm for large sets. You can look at:

You can also use compression and encode each letter of your DNA into 2 bits before storing into the sort array.

dsimcha's MSB radix sort looks nice, but Nils gets closer to the heart of the problem with the observation that cache locality is what's killing you at large problem sizes.

I suggest a very simple approach:

  1. Empirically estimate the largest size m for which a radix sort is efficient.
  2. Read blocks of m elements at a time, radix sort them, and write them out (to a memory buffer if you have enough memory, but otherwise to file), until you exhaust your input.
  3. Mergesort the resulting sorted blocks.

Mergesort is the most cache-friendly sorting algorithm I'm aware of: "Read the next item from either array A or B, then write an item to the output buffer." It runs efficiently on tape drives. It does require 2n space to sort n items, but my bet is that the much-improved cache locality you'll see will make that unimportant -- and if you were using a non-in-place radix sort, you needed that extra space anyway.

Please note finally that mergesort can be implemented without recursion, and in fact doing it this way makes clear the true linear memory access pattern.

It looks like you've solved the problem, but for the record, it appears that one version of a workable in-place radix sort is the "American Flag Sort". It's described here: Engineering Radix Sort. The general idea is to do 2 passes on each character - first count how many of each you have, so you can subdivide the input array into bins. Then go through again, swapping each element into the correct bin. Now recursively sort each bin on the next character position.

Stephan Eggermont

First, think about the coding of your problem. Get rid of the strings, replace them by a binary representation. Use the first byte to indicate length+encoding. Alternatively, use a fixed length representation at a four-byte boundary. Then the radix sort becomes much easier. For a radix sort, the most important thing is to not have exception handling at the hot spot of the inner loop.

OK, I thought a bit more about the 4-nary problem. You want a solution like a Judy tree for this. The next solution can handle variable length strings; for fixed length just remove the length bits, that actually makes it easier.

Allocate blocks of 16 pointers. The least significant bit of the pointers can be reused, as your blocks will always be aligned. You might want a special storage allocator for it (breaking up large storage into smaller blocks). There are a number of different kinds of blocks:

  • Encoding with 7 length bits of variable-length strings. As they fill up, you replace them by:
  • Position encodes the next two characters, you have 16 pointers to the next blocks, ending with:
  • Bitmap encoding of the last three characters of a string.

For each kind of block, you need to store different information in the LSBs. As you have variable length strings you need to store end-of-string too, and the last kind of block can only be used for the longest strings. The 7 length bits should be replaced by less as you get deeper into the structure.

This provides you with a reasonably fast and very memory efficient storage of sorted strings. It will behave somewhat like a trie. To get this working, make sure to build enough unit tests. You want coverage of all block transitions. You want to start with only the second kind of block.

For even more performance, you might want to add different block types and a larger size of block. If the blocks are always the same size and large enough, you can use even fewer bits for the pointers. With a block size of 16 pointers, you already have a byte free in a 32-bit address space. Take a look at the Judy tree documentation for interesting block types. Basically, you add code and engineering time for a space (and runtime) trade-off

You probably want to start with a 256 wide direct radix for the first four characters. That provides a decent space/time tradeoff. In this implementation, you get much less memory overhead than with a simple trie; it is approximately three times smaller (I haven't measured). O(n) is no problem if the constant is low enough, as you noticed when comparing with the O(n log n) quicksort.

Are you interested in handling doubles? With short sequences, there are going to be. Adapting the blocks to handle counts is tricky, but it can be very space-efficient.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!