What algorithm to use to delete duplicates?

问题

Imagine that we have some file, called, for example, "A.txt". We know that there are some duplicate elements. "A.txt" is very big, more than ten times bigger than memory, maybe around 50GB. Sometimes, size of B will be approximately equal to size of A, sometimes it will be many times smaller than size of A. Let it have structure like that:

a 1
b 2
c 445
a 1

We need to get file "B.txt", that will not have such duplicates. As example, it should be this:

a 1
b 2
c 445

I thought about algorithm that copy A and does B, then takes first string in B, and look for each another, if finds the same, deletes duplicates. Then takes second string, etc.

But I think it is way too slow. What can I use?

A is not database! No SQL, please.

Sorry, that didn't said, sorting is OK.

Although it can be sorted, what about if it cannot be sorted?

回答1:

Suppose you can fit 1/k'th of the file into memory and still have room for working data structures. The whole file can be processed in k or fewer passes, as below, and this has a chance of being much faster than sorting the whole file depending on line lengths and sort-algorithm constants. Sorting averages O(n ln n) and the process below is O(k n) worst case. For example, if lines average 10 characters and there are n = 5G lines, ln(n) ~ 22.3. In addition, if your output file B is much smaller than the input file A, the process probably will take only one or two passes.

Process:

Allocate a few megabytes for input buffer I, a few gigabytes for a result buffer R, and a gigabyte or so for a hash table H. Open input file F and output file O.
Repeat: Fill I from F and process it into R, via step 3.
For each line L in I, check if L is already in H and R. If so, go on to next L, else add L to R and its hash to H.
When R is full, say with M entries, write it to O. Then repeatedly fill I from F, dedup as in step 3, and write to O. At EOF(F) go to 5.
Repeat (using old O as input F and a new O for output): Read M lines from F and copy to O. Then load R and H as in steps 2 and 3, and copy to EOF(F) with dedup as before. Set M to the new number of non-dupped lines at the beginning of each O file.

Note that after each pass, the first M lines of O contain no duplicates, and none of those M lines are duplicated in the rest of O. Thus, at least 1/k'th of the original file is processed per pass, so processing takes at most k passes.

Update 1 Instead of repeatedly writing out and reading back in the already-processed leading lines, a separate output file P should be used, to which process buffer R is appended at the end of each pass. This cuts the amount of reading and writing by a factor of k/2 when result file B is nearly as large as A, or by somewhat less when B is much smaller than A; but in no case does it increase the amount of I/O.

回答2:

One solution would be to sort the file, then copy one line at a time to a new file, filtering out consecutive duplicates.

Then the question becomes: how do you sort a file that is too big to fit in memory?

Here's how Unix sort does it.

回答3:

You will essentially have to build up a searchable result set (if the language reminds of you database technology, this is no accident, no matter how much you hate the fact that databases deal with the same questions as you do).

One of the possible efficient data structures for that is either a sorted range (implementable as a tree of some sort), or a hash table. So as you process your file, you insert each record into your result set, efficiently, and at that stage you get to check whether the result already exists. When you're done, you will have a reduced set of unique records.

Rather than duplicating the actual record, your result set could also store a reference of some sort to any one of the original records. It depends on whether the records are large enough to make that a more efficient solution.

Or you could simply add a mark to the original data whether or not the record is to be included.

(Also consider using an efficient storage format like NetCDF for binary data, as a textual representation is far far slower to access and process.)

来源：https://stackoverflow.com/questions/7995698/what-algorithm-to-use-to-delete-duplicates

标签

algorithm

duplicates