As part of a project I\'m working on, I\'d like to clean up a file I generate of duplicate line entries. These duplicates often won\'t occur near each other, however. I came
Does it matter in which order the lines come, and how many duplicates are you counting on seeing?
If not, and if you're counting on a lot of dupes (i.e. a lot more reading than writing) I'd also think about parallelizing the hashset solution, with the hashset as a shared resource.