As part of a project I\'m working on, I\'d like to clean up a file I generate of duplicate line entries. These duplicates often won\'t occur near each other, however. I came
The Hash Set approach is OK, but you can tweak it to not have to store all the Strings in memory, but a logical pointer to the location in the file so you can go back to read the actual value only in case you need it.
Another creative approach is to append to each line the number of the line, then sort all the lines, remove the duplicates (ignoring the last token that should be the number), and then sort again the file by the last token and striping it out in the output.