Merge huge files without loading whole file into memory?

社会主义新天地 提交于 2019-11-30 13:36:53

问题


I want to merge huge files containing strings into one file and tried to use nio2. I do not want to load the whole file into memory, so I tried it with BufferedReader:

public void mergeFiles(filesToBeMerged) throws IOException{

Path mergedFile = Paths.get("mergedFile");
Files.createFile(mergedFile);

List<Path> _filesToBeMerged = filesToBeMerged;

try (BufferedWriter writer = Files.newBufferedWriter(mergedFile,StandardOpenOption.APPEND)) {
        for (Path file : _filesToBeMerged) {
// this does not work as write()-method does not accept a BufferedReader
            writer.append(Files.newBufferedReader(file));
        }
    } catch (IOException e) {
        System.err.println(e);
    }

}

I tried it with this, this works, hower, the format of the strings (e.g. new lines, etc is not copied to the merged file):

...
try (BufferedWriter writer = Files.newBufferedWriter(mergedFile,StandardOpenOption.APPEND)) {
        for (Path file : _filesToBeMerged) {
//              writer.write(Files.newBufferedReader(file));
            String line = null;


BufferedReader reader = Files.newBufferedReader(file);
            while ((line = reader.readLine()) != null) {
                    writer.append(line);
                    writer.append(System.lineSeparator());
             }
reader.close();
        }
    } catch (IOException e) {
        System.err.println(e);
    }
...

How can I merge huge Files with NIO2 without loading the whole file into memory?


回答1:


If you want to merge two or more files efficiently you should ask yourself, why on earth are you using char based Reader and Writer to perform that task.

By using these classes you are performing a conversion of the file’s bytes to characters from the system’s default encoding to unicode and back from unicode to the system’s default encoding. This means the program has to perform two data conversion on the entire files.

And, by the way, BufferedReader and BufferedWriter are by no means NIO2 artifacts. These classes exists since the very first version of Java.

When you are using byte-wise copying via real NIO functions, the files can be transferred without being touched by the Java application, in the best case the transfer will be performed directly in the file system’s buffer:

import static java.nio.file.StandardOpenOption.*;

import java.io.IOException;
import java.nio.channels.FileChannel;
import java.nio.file.Path;
import java.nio.file.Paths;

public class MergeFiles
{
  public static void main(String[] arg) throws IOException {
    if(arg.length<2) {
      System.err.println("Syntax: infiles... outfile");
      System.exit(1);
    }
    Path outFile=Paths.get(arg[arg.length-1]);
    System.out.println("TO "+outFile);
    try(FileChannel out=FileChannel.open(outFile, CREATE, WRITE)) {
      for(int ix=0, n=arg.length-1; ix<n; ix++) {
        Path inFile=Paths.get(arg[ix]);
        System.out.println(inFile+"...");
        try(FileChannel in=FileChannel.open(inFile, READ)) {
          for(long p=0, l=in.size(); p<l; )
            p+=in.transferTo(p, l-p, out);
        }
      }
    }
    System.out.println("DONE.");
  }
}



回答2:


With

Files.newBufferedReader(file).readLine()

you create a new Buffer everytime and it gets always reset in the first line.

Replace with

BufferedReader reader = Files.newBufferedReader(file);
while ((line = reader.readLine()) != null) {
  writer.write(line);
}

and .close() the reader when done.




回答3:


readLine() does not yield the line ending ("\n" or "\r\n"). That was the error.

while ((line = reader.readLine()) != null) {
    writer.write(line);
    writer.write("\r\n"); // Windows
}

You might also disregard this filtering of (possibly different) line endings, and use

try (OutputStream out = new FileOutputStream(file);
    for (Path source : filesToBeMerged) {
        Files.copy(path, out);
        out.write("\r\n".getBytes(StandardCharsets.US_ASCII));
    }
}

This writes a newline explicitly, in the case that the last line does not end with a line break.

There might still be a problem with the optional, ugly Unicode BOM character to mark the text as UTF-8/UTF-16LE/UTF-16BE at the beginning of the file.



来源:https://stackoverflow.com/questions/25546750/merge-huge-files-without-loading-whole-file-into-memory

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!