How to list all files in a directory and its subdirectories in hadoop hdfs

后端 未结 9 1047
故里飘歌
故里飘歌 2020-12-01 05:50

I have a folder in hdfs which has two subfolders each one has about 30 subfolders which,finally,each one contains xml files. I want to list all xml files giving only the mai

相关标签:
9条回答
  • 2020-12-01 06:23
    /**
     * @param filePath
     * @param fs
     * @return list of absolute file path present in given path
     * @throws FileNotFoundException
     * @throws IOException
     */
    public static List<String> getAllFilePath(Path filePath, FileSystem fs) throws FileNotFoundException, IOException {
        List<String> fileList = new ArrayList<String>();
        FileStatus[] fileStatus = fs.listStatus(filePath);
        for (FileStatus fileStat : fileStatus) {
            if (fileStat.isDirectory()) {
                fileList.addAll(getAllFilePath(fileStat.getPath(), fs));
            } else {
                fileList.add(fileStat.getPath().toString());
            }
        }
        return fileList;
    }
    

    Quick Example : Suppose you have the following file structure:

    a  ->  b
       ->  c  -> d
              -> e 
       ->  d  -> f
    

    Using the code above, you get:

    a/b
    a/c/d
    a/c/e
    a/d/f
    

    If you want only the leaf (i.e. fileNames), use the following code in else block :

     ...
        } else {
            String fileName = fileStat.getPath().toString(); 
            fileList.add(fileName.substring(fileName.lastIndexOf("/") + 1));
        }
    

    This will give:

    b
    d
    e
    f
    
    0 讨论(0)
  • 2020-12-01 06:25

    Here is a code snippet, that counts number of files in a particular HDFS directory (I used this to determine how many reducers to use in a particular ETL code). You can easily modify this to suite your needs.

    private int calculateNumberOfReducers(String input) throws IOException {
        int numberOfReducers = 0;
        Path inputPath = new Path(input);
        FileSystem fs = inputPath.getFileSystem(getConf());
        FileStatus[] statuses = fs.globStatus(inputPath);
        for(FileStatus status: statuses) {
            if(status.isDirectory()) {
                numberOfReducers += getNumberOfInputFiles(status, fs);
            } else if(status.isFile()) {
                numberOfReducers ++;
            }
        }
        return numberOfReducers;
    }
    
    /**
     * Recursively determines number of input files in an HDFS directory
     *
     * @param status instance of FileStatus
     * @param fs instance of FileSystem
     * @return number of input files within particular HDFS directory
     * @throws IOException
     */
    private int getNumberOfInputFiles(FileStatus status, FileSystem fs) throws IOException  {
        int inputFileCount = 0;
        if(status.isDirectory()) {
            FileStatus[] files = fs.listStatus(status.getPath());
            for(FileStatus file: files) {
                inputFileCount += getNumberOfInputFiles(file, fs);
            }
        } else {
            inputFileCount ++;
        }
    
        return inputFileCount;
    }
    
    0 讨论(0)
  • 2020-12-01 06:27

    Now, one can use Spark to do the same and its way faster than other approaches (such as Hadoop MR). Here is the code snippet.

    def traverseDirectory(filePath:String,recursiveTraverse:Boolean,filePaths:ListBuffer[String]) {
        val files = FileSystem.get( sparkContext.hadoopConfiguration ).listStatus(new Path(filePath))
                files.foreach { fileStatus => {
                    if(!fileStatus.isDirectory() && fileStatus.getPath().getName().endsWith(".xml")) {                
                        filePaths+=fileStatus.getPath().toString()      
                    }
                    else if(fileStatus.isDirectory()) {
                        traverseDirectory(fileStatus.getPath().toString(), recursiveTraverse, filePaths)
                    }
                }
        }   
    }
    
    0 讨论(0)
提交回复
热议问题