I have a folder in hdfs which has two subfolders each one has about 30 subfolders which,finally,each one contains xml files. I want to list all xml files giving only the mai
/**
* @param filePath
* @param fs
* @return list of absolute file path present in given path
* @throws FileNotFoundException
* @throws IOException
*/
public static List<String> getAllFilePath(Path filePath, FileSystem fs) throws FileNotFoundException, IOException {
List<String> fileList = new ArrayList<String>();
FileStatus[] fileStatus = fs.listStatus(filePath);
for (FileStatus fileStat : fileStatus) {
if (fileStat.isDirectory()) {
fileList.addAll(getAllFilePath(fileStat.getPath(), fs));
} else {
fileList.add(fileStat.getPath().toString());
}
}
return fileList;
}
Quick Example : Suppose you have the following file structure:
a -> b
-> c -> d
-> e
-> d -> f
Using the code above, you get:
a/b
a/c/d
a/c/e
a/d/f
If you want only the leaf (i.e. fileNames), use the following code in else
block :
...
} else {
String fileName = fileStat.getPath().toString();
fileList.add(fileName.substring(fileName.lastIndexOf("/") + 1));
}
This will give:
b
d
e
f
Here is a code snippet, that counts number of files in a particular HDFS directory (I used this to determine how many reducers to use in a particular ETL code). You can easily modify this to suite your needs.
private int calculateNumberOfReducers(String input) throws IOException {
int numberOfReducers = 0;
Path inputPath = new Path(input);
FileSystem fs = inputPath.getFileSystem(getConf());
FileStatus[] statuses = fs.globStatus(inputPath);
for(FileStatus status: statuses) {
if(status.isDirectory()) {
numberOfReducers += getNumberOfInputFiles(status, fs);
} else if(status.isFile()) {
numberOfReducers ++;
}
}
return numberOfReducers;
}
/**
* Recursively determines number of input files in an HDFS directory
*
* @param status instance of FileStatus
* @param fs instance of FileSystem
* @return number of input files within particular HDFS directory
* @throws IOException
*/
private int getNumberOfInputFiles(FileStatus status, FileSystem fs) throws IOException {
int inputFileCount = 0;
if(status.isDirectory()) {
FileStatus[] files = fs.listStatus(status.getPath());
for(FileStatus file: files) {
inputFileCount += getNumberOfInputFiles(file, fs);
}
} else {
inputFileCount ++;
}
return inputFileCount;
}
Now, one can use Spark to do the same and its way faster than other approaches (such as Hadoop MR). Here is the code snippet.
def traverseDirectory(filePath:String,recursiveTraverse:Boolean,filePaths:ListBuffer[String]) {
val files = FileSystem.get( sparkContext.hadoopConfiguration ).listStatus(new Path(filePath))
files.foreach { fileStatus => {
if(!fileStatus.isDirectory() && fileStatus.getPath().getName().endsWith(".xml")) {
filePaths+=fileStatus.getPath().toString()
}
else if(fileStatus.isDirectory()) {
traverseDirectory(fileStatus.getPath().toString(), recursiveTraverse, filePaths)
}
}
}
}