Remove duplicate rows from csv file without write a new file

问题

This is my code for now:

File file1 = new File("file1.csv");
File file2 = new File("file2.csv");
HashSet<String> f1 = new HashSet<>(FileUtils.readLines(file1));
HashSet<String> f2 = new HashSet<>(FileUtils.readLines(file2));
f2.removeAll(f1);

With removeAll() I remove all duplicates wich are in file2 from file1, but now I want to avoid to create a new csv file to optimize the process. Just want to delete from file2 the duplicate rows.

Is this possible or I have to create a new file?

回答1:

now I want to avoid to create a new csv file to optimize the process.

Well, sure, you can do that... If you don't mind possibly losing the file!

DON'T DO THAT.

And since you use Java 7, well, use java.nio.file. Here's an example:

final Path file1 = Paths.get("file1.csv");
final Path file2 = Paths.get("file2.csv");
final Path tmpfile = file2.resolveSibling("file2.csv.new");

final Set<String> file1Lines 
    = new HashSet<>(Files.readAllLines(file1, StandardCharsets.UTF_8));

try (
    final BufferedReader reader = Files.newBufferedReader(file2,
        StandardCharsets.UTF_8);
    final BufferedWriter writer = Files.newBufferedWriter(tmpfile,
        StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW);
) {
    String line;
    while ((line = reader.readLine()) != null)
        if (!file1Lines.contains(line)) {
            writer.write(line);
            writer.newLine();
        }
}

try {
    Files.move(tmpfile, file2, StandardCopyOption.REPLACE_EXISTING,
        StandardCopyOption.ATOMIC_MOVE);
} catch (AtomicMoveNotSupportedException ignored) {
    Files.move(tmpfile, file2, StandardCopyOption.REPLACE_EXISTING);
}

If you use Java 8, you can use this try-with-resources block instead:

try (
    final Stream<String> stream = Files.lines(file2, StandardCharsets.UTF_8);
    final BufferedWriter writer = Files.newBufferedWriter(tmpfile,
        StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW);
) {
    stream.filter(line -> !file1Lines.contains(line))
        .forEach(line -> { writer.write(line); writer.newLine(); });
}

回答2:

I've solved with this line of code:

FileUtils.writeLines(file2, f2);

It is an overwrite and can be a good solution for small-medium file, but for very large dataset I sincerly don't know.

来源：https://stackoverflow.com/questions/27875560/remove-duplicate-rows-from-csv-file-without-write-a-new-file

标签

java

csv

hashset