How to list all files in a directory and its subdirectories in hadoop hdfs

后端未结

关注

 9  1057

I have a folder in hdfs which has two subfolders each one has about 30 subfolders which,finally,each one contains xml files. I want to list all xml files giving only the mai

相关标签:

9条回答

别那么骄傲

2020-12-01 06:23

/**
 * @param filePath
 * @param fs
 * @return list of absolute file path present in given path
 * @throws FileNotFoundException
 * @throws IOException
 */
public static List<String> getAllFilePath(Path filePath, FileSystem fs) throws FileNotFoundException, IOException {
    List<String> fileList = new ArrayList<String>();
    FileStatus[] fileStatus = fs.listStatus(filePath);
    for (FileStatus fileStat : fileStatus) {
        if (fileStat.isDirectory()) {
            fileList.addAll(getAllFilePath(fileStat.getPath(), fs));
        } else {
            fileList.add(fileStat.getPath().toString());
        }
    }
    return fileList;
}

Quick Example : Suppose you have the following file structure:

a  ->  b
   ->  c  -> d
          -> e 
   ->  d  -> f

Using the code above, you get:

a/b
a/c/d
a/c/e
a/d/f

If you want only the leaf (i.e. fileNames), use the following code in else block :

 ...
    } else {
        String fileName = fileStat.getPath().toString(); 
        fileList.add(fileName.substring(fileName.lastIndexOf("/") + 1));
    }

This will give:

b
d
e
f

0 讨论(0)

孤独总比滥情好

2020-12-01 06:25

Here is a code snippet, that counts number of files in a particular HDFS directory (I used this to determine how many reducers to use in a particular ETL code). You can easily modify this to suite your needs.

private int calculateNumberOfReducers(String input) throws IOException {
    int numberOfReducers = 0;
    Path inputPath = new Path(input);
    FileSystem fs = inputPath.getFileSystem(getConf());
    FileStatus[] statuses = fs.globStatus(inputPath);
    for(FileStatus status: statuses) {
        if(status.isDirectory()) {
            numberOfReducers += getNumberOfInputFiles(status, fs);
        } else if(status.isFile()) {
            numberOfReducers ++;
        }
    }
    return numberOfReducers;
}

/**
 * Recursively determines number of input files in an HDFS directory
 *
 * @param status instance of FileStatus
 * @param fs instance of FileSystem
 * @return number of input files within particular HDFS directory
 * @throws IOException
 */
private int getNumberOfInputFiles(FileStatus status, FileSystem fs) throws IOException  {
    int inputFileCount = 0;
    if(status.isDirectory()) {
        FileStatus[] files = fs.listStatus(status.getPath());
        for(FileStatus file: files) {
            inputFileCount += getNumberOfInputFiles(file, fs);
        }
    } else {
        inputFileCount ++;
    }

    return inputFileCount;
}

0 讨论(0)

粉色の甜心

2020-12-01 06:27

Now, one can use Spark to do the same and its way faster than other approaches (such as Hadoop MR). Here is the code snippet.

def traverseDirectory(filePath:String,recursiveTraverse:Boolean,filePaths:ListBuffer[String]) {
    val files = FileSystem.get( sparkContext.hadoopConfiguration ).listStatus(new Path(filePath))
            files.foreach { fileStatus => {
                if(!fileStatus.isDirectory() && fileStatus.getPath().getName().endsWith(".xml")) {                
                    filePaths+=fileStatus.getPath().toString()      
                }
                else if(fileStatus.isDirectory()) {
                    traverseDirectory(fileStatus.getPath().toString(), recursiveTraverse, filePaths)
                }
            }
    }   
}

0 讨论(0)

上一页 1 2