to SORT 1 BILLION of integer numbers with small physical memory

荒凉一梦 提交于 2019-12-03 21:07:47

The simplest thing to do is break the input into smaller files that can fit in memory and sort each, and then merge the results.

Guido van Rossum has a good description of doing this in python while obviously not the same language the principle is the same.

Integers are 7 digit numbers.

So there are only 10 million possible values.

You have 1GB of RAM. Make an array of counters, one for each possible value.

Read through the file once, count up the counters.

When done, output the numbers according to the final counter values.

Every number can occur at most 1 billion times. So a 32bit counter would be enough. This means a 10M x 4 bytes = 40M byte array.

You specified that are sorting a billion 7 (decimal) digit numbers.

If there were no duplicates, you could sort in memory with 107 BITS using radix sort. Since you must have duplicates (107 less than 109), you could implement radix sort using (say) an array of 107 8-bit counters, with a HashMap<Integer, Integer> to deal with the relatively few cases where the counters overflow. Or just an array of 107 32-bit counters.

Another more general approach (that works for any kind of value) is to split the file into N smaller subfiles, sort each subfile in memory, and then perform an N-way merge of the sorted subfiles.

Using a BitSet with 4 billion possible values occupies 512 MB. Just set all the int values you see and write them out in order (they are naturally sorted)

This only works if you don't care about duplicates.

If counting duplicates matters I would still consider either a memory mapped file for counting, or using a merge sort of sorted subsections of data. (I believe the later is an expected answer)

I recently bough a 24 GB PC for under £1K, so a few GB isn't that much unless you limited by a hosted solution. (Or using a mobile device)

Assuming every integer occurs exactly one time you can read the file and for every number you find you set a bit - the bit array has to hold 10000000 bits - this uses only 1,28 MB RAM which should be available... after you have read all integers you just go through the array and output the numbers where a bit ist set...

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!