Remove All Duplicates In A Large Text File

问题

I am really stumped at this problem and as a result I have stopped working for a while. I work with really large pieces of data. I get approx 200gb of .txt data every week. The data can range up to 500 million lines. A lot of these are duplicate. I would guess only 20gb is unique. I have had several custom programs made including hash remove duplicates, external remove duplicates but none seem to work. The latest one was using a temp database but took several days to remove the data.

The problem with all the programs is that they crash after a certain point and after spending a large amount of money on these programs I thought I would come online and see if anyone can help. I understand this has been answered on here before and I have spent the last 3 hours reading about 50 threads on here but none seem to have the same problem as me i.e huge datasets.

Can anyone recommend anything for me? It needs to be super accurate and fast. Preferably not memory based as I only have 32gb of ram to work with.

回答1:

The standard way to remove duplicates is to sort the file and then do a sequential pass to remove duplicates. Sorting 500 million lines isn't trivial, but it's certainly doable. A few years ago I had a daily process that would sort 50 to 100 gigabytes on a 16 gb machine.

By the way, you might be able to do this with an off-the-shelf program. Certainly the GNU sort utility can sort a file larger than memory. I've never tried it on a 500 GB file, but you might give it a shot. You can download it along with the rest of the GNU Core Utilities. That utility has a --unique option, so you should be able to just sort --unique input-file > output-file. It uses a technique similar to the one I describe below. I'd suggest trying it on a 100 megabyte file first, then slowly working up to larger files.

With GNU sort and the technique I describe below, it will perform a lot better if the input and temporary directories are on separate physical disks. Put the output either on a third physical disk, or on the same physical disk as the input. You want to reduce I/O contention as much as possible.

There might also be a commercial (i.e. pay) program that will do the sorting. Developing a program that will sort a huge text file efficiently is a non-trivial task. If you can buy something for a few hundreds of dollars, you're probably money ahead if your time is worth anything.

If you can't use a ready made program, then . . .

If your text is in multiple smaller files, the problem is easier to solve. You start by sorting each file, removing duplicates from those files, and writing the sorted temporary files that have the duplicates removed. Then run a simple n-way merge to merge the files into a single output file that has the duplicates removed.

If you have a single file, you start by reading as many lines as you can into memory, sorting those, removing duplicates, and writing a temporary file. You keep doing that for the entire large file. When you're done, you have some number of sorted temporary files that you can then merge.

In pseudocode, it looks something like this:

fileNumber = 0
while not end-of-input
    load as many lines as you can into a list
    sort the list
    filename = "file"+fileNumber
    write sorted list to filename, optionally removing duplicates
    fileNumber = fileNumber + 1

You don't really have to remove the duplicates from the temporary files, but if your unique data is really only 10% of the total, you'll save a huge amount of time by not outputting duplicates to the temporary files.

Once all of your temporary files are written, you need to merge them. From your description, I figure each chunk that you read from the file will contain somewhere around 20 million lines. So you'll have maybe 25 temporary files to work with.

You now need to do a k-way merge. That's done by creating a priority queue. You open each file, read the first line from each file and put it into the queue along with a reference to the file that it came from. Then, you take the smallest item from the queue and write it to the output file. To remove duplicates, you keep track of the previous line that you output, and you don't output the new line if it's identical to the previous one.

Once you've output the line, you read the next line from the file that the one you just output came from, and add that line to the priority queue. You continue this way until you've emptied all of the files.

I published a series of articles some time back about sorting a very large text file. It uses the technique I described above. The only thing it doesn't do is remove duplicates, but that's a simple modification to the methods that output the temporary files and the final output method. Even without optimizations, the program performs quite well. It won't set any speed records, but it should be able to sort and remove duplicates from 500 million lines in less than 12 hours. Probably much less, considering that the second pass is only working with a small percentage of the total data (because you removed duplicates from the temporary files).

One thing you can do to speed the program is operate on smaller chunks and be sorting one chunk in a background thread while you're loading the next chunk into memory. You end up having to deal with more temporary files, but that's really not a problem. The heap operations are slightly slower, but that extra time is more than recaptured by overlapping the input and output with the sorting. You end up getting the I/O essentially for free. At typical hard drive speeds, loading 500 gigabytes will take somewhere in the neighborhood of two and a half to three hours.

Take a look at the article series. It's many different, mostly small, articles that take you through the entire process that I describe, and it presents working code. I'm happy to answer any questions you might have about it.

回答2:

I am no specialist in such algorithms, but if it is a textual data (or numbers, doesn't matter), you can try to read your big file and write it into several files by first two or three symbols: all lines starting with "aaa" go to aaa.txt, all lines starting with "aab" - to aab.txt, etc. You'll get lots of files within which the data are in the equivalence relation: a duplicate to a word is in the same file as the word itself. Now, just parse each file in the memory and you're done. Again, not sure that it will work, but i'd try this approach first...

来源：https://stackoverflow.com/questions/22623472/remove-all-duplicates-in-a-large-text-file

标签

java

c++

perl

duplicates