Remove duplicate rows from csv file without write a new file

本秂侑毒 提交于 2019-12-12 02:55:05

问题


This is my code for now:

File file1 = new File("file1.csv");
File file2 = new File("file2.csv");
HashSet<String> f1 = new HashSet<>(FileUtils.readLines(file1));
HashSet<String> f2 = new HashSet<>(FileUtils.readLines(file2));
f2.removeAll(f1);

With removeAll() I remove all duplicates wich are in file2 from file1, but now I want to avoid to create a new csv file to optimize the process. Just want to delete from file2 the duplicate rows.

Is this possible or I have to create a new file?


回答1:


now I want to avoid to create a new csv file to optimize the process.

Well, sure, you can do that... If you don't mind possibly losing the file!

DON'T DO THAT.

And since you use Java 7, well, use java.nio.file. Here's an example:

final Path file1 = Paths.get("file1.csv");
final Path file2 = Paths.get("file2.csv");
final Path tmpfile = file2.resolveSibling("file2.csv.new");

final Set<String> file1Lines 
    = new HashSet<>(Files.readAllLines(file1, StandardCharsets.UTF_8));

try (
    final BufferedReader reader = Files.newBufferedReader(file2,
        StandardCharsets.UTF_8);
    final BufferedWriter writer = Files.newBufferedWriter(tmpfile,
        StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW);
) {
    String line;
    while ((line = reader.readLine()) != null)
        if (!file1Lines.contains(line)) {
            writer.write(line);
            writer.newLine();
        }
}

try {
    Files.move(tmpfile, file2, StandardCopyOption.REPLACE_EXISTING,
        StandardCopyOption.ATOMIC_MOVE);
} catch (AtomicMoveNotSupportedException ignored) {
    Files.move(tmpfile, file2, StandardCopyOption.REPLACE_EXISTING);
}

If you use Java 8, you can use this try-with-resources block instead:

try (
    final Stream<String> stream = Files.lines(file2, StandardCharsets.UTF_8);
    final BufferedWriter writer = Files.newBufferedWriter(tmpfile,
        StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW);
) {
    stream.filter(line -> !file1Lines.contains(line))
        .forEach(line -> { writer.write(line); writer.newLine(); });
}



回答2:


I've solved with this line of code:

FileUtils.writeLines(file2, f2);

It is an overwrite and can be a good solution for small-medium file, but for very large dataset I sincerly don't know.



来源:https://stackoverflow.com/questions/27875560/remove-duplicate-rows-from-csv-file-without-write-a-new-file

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!