Deleting duplicate lines in a file using Java

后端未结

关注

 14  593

As part of a project I\'m working on, I\'d like to clean up a file I generate of duplicate line entries. These duplicates often won\'t occur near each other, however. I came

相关标签:

14条回答

挽巷

2020-12-14 02:13

There are two scalable solutions, where by scalable I mean disk and not memory based, depending whether the procedure should be stable or not, where by stable I mean that the order after removing duplicates is the same. if scalability isn't an issue, then simply use memory for the same sort of method.

For the non stable solution, first sort the file on the disk. This is done by splitting the file into smaller files, sorting the smaller chunks in memory, and then merging the files in sorted order, where the merge ignores duplicates.

The merge itself can be done using almost no memory, by comparing only the current line in each file, since the next line is guaranteed to be greater.

The stable solution is slightly trickier. First, sort the file in chunks as before, but indicate in each line the original line number. Then, during the "merge" don't bother storing the result, just the line numbers to be deleted.

Then copy the original file line by line, ignoring the line numbers you have stored above.

0 讨论(0)
发布评论:

提交评论
- 加载中...
天命终不由人

2020-12-14 02:14

Does it matter in which order the lines come, and how many duplicates are you counting on seeing?

If not, and if you're counting on a lot of dupes (i.e. a lot more reading than writing) I'd also think about parallelizing the hashset solution, with the hashset as a shared resource.

0 讨论(0)
发布评论:

提交评论
- 加载中...
借酒劲吻你

2020-12-14 02:16

Try a simple HashSet that stores the lines you have already read. Then iterate over the file. If you come across duplicates they are simply ignored (as a Set can only contain every element once).

0 讨论(0)
发布评论:

提交评论
- 加载中...

星月不相逢

2020-12-14 02:19

Something like this, perhaps:

BufferedReader in = ...;
Set<String> lines = new LinkedHashSet();
for (String line; (line = in.readLine()) != null;)
    lines.add(line); // does nothing if duplicate is already added
PrintWriter out = ...;
for (String line : lines)
    out.println(line);

LinkedHashSet keeps the insertion order, as opposed to HashSet which (while being slightly faster for lookup/insert) will reorder all lines.

0 讨论(0)

一生所求

2020-12-14 02:19
- Read in the file, storing the line number and the line: O(n)
- Sort it into alphabetical order: O(n log n)
- Remove duplicates: O(n)
- Sort it into its original line number order: O(n log n)
0 讨论(0)
发布评论:

提交评论
- 加载中...
情话喂你

2020-12-14 02:20
If you could use UNIX shell commands you could do something like the following:
```
for(i = line 0 to end)
{
    sed 's/\$i//2g' ; deletes all repeats
}
```
This would iterate through your whole file and only pass each unique occurrence once per sed call. This way you're not doing a bunch of searches you've done before.
0 讨论(0)
发布评论:

提交评论
- 加载中...