Interview puzzle: Sorting a million number input with limited memory

后端未结

关注

 3  947

野性不改 2020-12-29 01:02

I tried answering this using external sort, but interviewer replied that the complexity was to high n.n(log(n)) i.e. n square *logn. Is there a better alternative.

3条回答

渐次进展 (楼主)

2020-12-29 01:22
The standard way of doing it is an External Sort.

In external sort - it is not only important to have O(nlogn) comlexity - it is also critical to minimize as much as possible the disk reads/writes, and make the most reads and writes sequential (and not random) - since disk access is much more efficient when done sequentially.

The standard way of doing so is indeed a k-way merge sort, as suggsested by @JanDvorak, but there are some faults and addition to the suggestion I am aiming to correct:
1. First, doing an RS (Replacement-Selection) on the input decreases the number of initial "runs" (number of increasing sequences) and thus usually decrease the total number of iterations needed by the later on merge sort.
2. We need memory for buffering (reading and writing input) - thus, for memory size M, and file size M*10, we cannot do 10-way merge - it will result in a LOT of read disks (reading each element, rather then in blocks).
  The standard formula for k - the "order" of the merge is M/(2b) (where M is the size of your memory, and b is the size of each "buffer" (usually disk block).
3. Each merge sort step is done by reading b entries from each "run" generated in previous iteration - filling M/2 in the memory. The rest of the memory is for "prediction" (which allows continious work with minimal wait for IO) - requesting more elements from a a run, and for the output buffer - in order to guarantee sequential right in blocks.
4. Total number of iterations with this approach is log_k(N/(2M)) where k is the number of runs (previously calculated), M is the size of the memory, and N is the size of the file. Each iteration requires 1 sequential read and 1 sequential write of the entire file.
That said - the ratio of file_size/memory_size is usually MUCH more then 10. If you are interested only in a ratio of 10, a local optimizations might take place, but it is not for the more common case where file_size/memory_size >> 10
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...