I have to deal with a directory of about 2 million xml\'s to be processed.
I\'ve already solved the processing distributing the work between machines and threads us
In case you can use Java 7 this can be done in this way and you won't have those out of memory problems.
Path path = FileSystems.getDefault().getPath("C:\\path\\with\\lots\\of\\files");
Files.walkFileTree(path, new FileVisitor<Path>() {
@Override
public FileVisitResult preVisitDirectory(Path dir, BasicFileAttributes attrs) throws IOException {
return FileVisitResult.CONTINUE;
}
@Override
public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException {
// here you have the files to process
System.out.println(file);
return FileVisitResult.CONTINUE;
}
@Override
public FileVisitResult visitFileFailed(Path file, IOException exc) throws IOException {
return FileVisitResult.TERMINATE;
}
@Override
public FileVisitResult postVisitDirectory(Path dir, IOException exc) throws IOException {
return FileVisitResult.CONTINUE;
}
});
You can do that with Apache FileUtils library. No memory problem. I did check with visualvm.
Iterator<File> it = FileUtils.iterateFiles(folder, null, true);
while (it.hasNext())
{
File fileEntry = (File) it.next();
}
Hope that helps. bye
Why do you store 2 million files in the same directory anyway? I can imagine it slows down access terribly on the OS level already.
I would definitely want to have them divided into subdirectories (e.g. by date/time of creation) already before processing. But if it is not possible for some reason, could it be done during processing? E.g. move 1000 files queued for Process1 into Directory1, another 1000 files for Process2 into Directory2 etc. Then each process/thread sees only the (limited number of) files portioned for it.