In-Place Radix Sort

后端未结

关注

 15  1316

This is a long text. Please bear with me. Boiled down, the question is: Is there a workable in-place radix sort algorithm?

Preliminary

相关标签:

15条回答

误落风尘

2020-12-02 04:12

It looks like you've solved the problem, but for the record, it appears that one version of a workable in-place radix sort is the "American Flag Sort". It's described here: Engineering Radix Sort. The general idea is to do 2 passes on each character - first count how many of each you have, so you can subdivide the input array into bins. Then go through again, swapping each element into the correct bin. Now recursively sort each bin on the next character position.

0 讨论(0)
发布评论:

提交评论
- 加载中...

天命终不由人

2020-12-02 04:14

You can certainly drop the memory requirements by encoding the sequence in bits. You are looking at permutations so, for length 2, with "ACGT" that's 16 states, or 4 bits. For length 3, that's 64 states, which can be encoded in 6 bits. So it looks like 2 bits for each letter in the sequence, or about 32 bits for 16 characters like you said.

If there is a way to reduce the number of valid 'words', further compression may be possible.

So for sequences of length 3, one could create 64 buckets, maybe sized uint32, or uint64. Initialize them to zero. Iterate through your very very large list of 3 char sequences, and encode them as above. Use this as a subscript, and increment that bucket.
Repeat this until all of your sequences have been processed.

Next, regenerate your list.

Iterate through the 64 buckets in order, for the count found in that bucket, generate that many instances of the sequence represented by that bucket.
when all of the buckets have been iterated, you have your sorted array.

A sequence of 4, adds 2 bits, so there would be 256 buckets. A sequence of 5, adds 2 bits, so there would be 1024 buckets.

At some point the number of buckets will approach your limits. If you read the sequences from a file, instead of keeping them in memory, more memory would be available for buckets.

I think this would be faster than doing the sort in situ as the buckets are likely to fit within your working set.

Here is a hack that shows the technique

#include <iostream>
#include <iomanip>

#include <math.h>

using namespace std;

const int width = 3;
const int bucketCount = exp(width * log(4)) + 1;
      int *bucket = NULL;

const char charMap[4] = {'A', 'C', 'G', 'T'};

void setup
(
    void
)
{
    bucket = new int[bucketCount];
    memset(bucket, '\0', bucketCount * sizeof(bucket[0]));
}

void teardown
(
    void
)
{
    delete[] bucket;
}

void show
(
    int encoded
)
{
    int z;
    int y;
    int j;
    for (z = width - 1; z >= 0; z--)
    {
        int n = 1;
        for (y = 0; y < z; y++)
            n *= 4;

        j = encoded % n;
        encoded -= j;
        encoded /= n;
        cout << charMap[encoded];
        encoded = j;
    }

    cout << endl;
}

int main(void)
{
    // Sort this sequence
    const char *testSequence = "CAGCCCAAAGGGTTTAGACTTGGTGCGCAGCAGTTAAGATTGTTT";

    size_t testSequenceLength = strlen(testSequence);

    setup();


    // load the sequences into the buckets
    size_t z;
    for (z = 0; z < testSequenceLength; z += width)
    {
        int encoding = 0;

        size_t y;
        for (y = 0; y < width; y++)
        {
            encoding *= 4;

            switch (*(testSequence + z + y))
            {
                case 'A' : encoding += 0; break;
                case 'C' : encoding += 1; break;
                case 'G' : encoding += 2; break;
                case 'T' : encoding += 3; break;
                default  : abort();
            };
        }

        bucket[encoding]++;
    }

    /* show the sorted sequences */ 
    for (z = 0; z < bucketCount; z++)
    {
        while (bucket[z] > 0)
        {
            show(z);
            bucket[z]--;
        }
    }

    teardown();

    return 0;
}

0 讨论(0)

日久生厌

2020-12-02 04:16
Radix-Sort is not cache conscious and is not the fastest sort algorithm for large sets. You can look at:
- ti7qsort. ti7qsort is the fastest sort for integers (can be used for small-fixed size strings).
- Inline QSORT
- String sorting
You can also use compression and encode each letter of your DNA into 2 bits before storing into the sort array.
0 讨论(0)
发布评论:

提交评论
- 加载中...
栀梦

2020-12-02 04:18
First, think about the coding of your problem. Get rid of the strings, replace them by a binary representation. Use the first byte to indicate length+encoding. Alternatively, use a fixed length representation at a four-byte boundary. Then the radix sort becomes much easier. For a radix sort, the most important thing is to not have exception handling at the hot spot of the inner loop.

OK, I thought a bit more about the 4-nary problem. You want a solution like a Judy tree for this. The next solution can handle variable length strings; for fixed length just remove the length bits, that actually makes it easier.

Allocate blocks of 16 pointers. The least significant bit of the pointers can be reused, as your blocks will always be aligned. You might want a special storage allocator for it (breaking up large storage into smaller blocks). There are a number of different kinds of blocks:
- Encoding with 7 length bits of variable-length strings. As they fill up, you replace them by:
- Position encodes the next two characters, you have 16 pointers to the next blocks, ending with:
- Bitmap encoding of the last three characters of a string.
For each kind of block, you need to store different information in the LSBs. As you have variable length strings you need to store end-of-string too, and the last kind of block can only be used for the longest strings. The 7 length bits should be replaced by less as you get deeper into the structure.

This provides you with a reasonably fast and very memory efficient storage of sorted strings. It will behave somewhat like a trie. To get this working, make sure to build enough unit tests. You want coverage of all block transitions. You want to start with only the second kind of block.

For even more performance, you might want to add different block types and a larger size of block. If the blocks are always the same size and large enough, you can use even fewer bits for the pointers. With a block size of 16 pointers, you already have a byte free in a 32-bit address space. Take a look at the Judy tree documentation for interesting block types. Basically, you add code and engineering time for a space (and runtime) trade-off

You probably want to start with a 256 wide direct radix for the first four characters. That provides a decent space/time tradeoff. In this implementation, you get much less memory overhead than with a simple trie; it is approximately three times smaller (I haven't measured). O(n) is no problem if the constant is low enough, as you noticed when comparing with the O(n log n) quicksort.

Are you interested in handling doubles? With short sequences, there are going to be. Adapting the blocks to handle counts is tricky, but it can be very space-efficient.
0 讨论(0)
发布评论:

提交评论
- 加载中...

野趣味

2020-12-02 04:19

If your data set is so big, then I would think that a disk-based buffer approach would be best:

sort(List<string> elements, int prefix)
    if (elements.Count < THRESHOLD)
         return InMemoryRadixSort(elements, prefix)
    else
         return DiskBackedRadixSort(elements, prefix)

DiskBackedRadixSort(elements, prefix)
    DiskBackedBuffer<string>[] buckets
    foreach (element in elements)
        buckets[element.MSB(prefix)].Add(element);

    List<string> ret
    foreach (bucket in buckets)
        ret.Add(sort(bucket, prefix + 1))

    return ret

I would also experiment grouping into a larger number of buckets, for instance, if your string was:

GATTACA

the first MSB call would return the bucket for GATT (256 total buckets), that way you make fewer branches of the disk based buffer. This may or may not improve performance, so experiment with it.

0 讨论(0)

走了就别回头了

2020-12-02 04:19

"Radix sorting with no extra space" is a paper addressing your problem.

0 讨论(0)
发布评论:

提交评论
- 加载中...