I have a folder in hdfs which has two subfolders each one has about 30 subfolders which,finally,each one contains xml files. I want to list all xml files giving only the mai
Now, one can use Spark to do the same and its way faster than other approaches (such as Hadoop MR). Here is the code snippet.
def traverseDirectory(filePath:String,recursiveTraverse:Boolean,filePaths:ListBuffer[String]) {
val files = FileSystem.get( sparkContext.hadoopConfiguration ).listStatus(new Path(filePath))
files.foreach { fileStatus => {
if(!fileStatus.isDirectory() && fileStatus.getPath().getName().endsWith(".xml")) {
filePaths+=fileStatus.getPath().toString()
}
else if(fileStatus.isDirectory()) {
traverseDirectory(fileStatus.getPath().toString(), recursiveTraverse, filePaths)
}
}
}
}