How can I find all of the distinct file extensions in a folder hierarchy?

后端 未结 16 1249
梦谈多话
梦谈多话 2020-11-30 16:00

On a Linux machine I would like to traverse a folder hierarchy and get a list of all of the distinct file extensions within it.

What would be the best way to achieve

相关标签:
16条回答
  • 2020-11-30 16:54

    Powershell:

    dir -recurse | select-object extension -unique
    

    Thanks to http://kevin-berridge.blogspot.com/2007/11/windows-powershell.html

    0 讨论(0)
  • 2020-11-30 16:54

    I tried a bunch of the answers here, even the "best" answer. They all came up short of what I specifically was after. So besides the past 12 hours of sitting in regex code for multiple programs and reading and testing these answers this is what I came up with which works EXACTLY like I want.

     find . -type f -name "*.*" | grep -o -E "\.[^\.]+$" | grep -o -E "[[:alpha:]]{2,16}" | awk '{print tolower($0)}' | sort -u
    
    • Finds all files which may have an extension.
    • Greps only the extension
    • Greps for file extensions between 2 and 16 characters (just adjust the numbers if they don't fit your need). This helps avoid cache files and system files (system file bit is to search jail).
    • Awk to print the extensions in lower case.
    • Sort and bring in only unique values. Originally I had attempted to try the awk answer but it would double print items that varied in case sensitivity.

    If you need a count of the file extensions then use the below code

    find . -type f -name "*.*" | grep -o -E "\.[^\.]+$" | grep -o -E "[[:alpha:]]{2,16}" | awk '{print tolower($0)}' | sort | uniq -c | sort -rn
    

    While these methods will take some time to complete and probably aren't the best ways to go about the problem, they work.

    Update: Per @alpha_989 long file extensions will cause an issue. That's due to the original regex "[[:alpha:]]{3,6}". I have updated the answer to include the regex "[[:alpha:]]{2,16}". However anyone using this code should be aware that those numbers are the min and max of how long the extension is allowed for the final output. Anything outside that range will be split into multiple lines in the output.

    Note: Original post did read "- Greps for file extensions between 3 and 6 characters (just adjust the numbers if they don't fit your need). This helps avoid cache files and system files (system file bit is to search jail)."

    Idea: Could be used to find file extensions over a specific length via:

     find . -type f -name "*.*" | grep -o -E "\.[^\.]+$" | grep -o -E "[[:alpha:]]{4,}" | awk '{print tolower($0)}' | sort -u
    

    Where 4 is the file extensions length to include and then find also any extensions beyond that length.

    0 讨论(0)
  • 2020-11-30 16:55

    In Python using generators for very large directories, including blank extensions, and getting the number of times each extension shows up:

    import json
    import collections
    import itertools
    import os
    
    root = '/home/andres'
    files = itertools.chain.from_iterable((
        files for _,_,files in os.walk(root)
        ))
    counter = collections.Counter(
        (os.path.splitext(file_)[1] for file_ in files)
    )
    print json.dumps(counter, indent=2)
    
    0 讨论(0)
  • 2020-11-30 16:55

    I've found it simple and fast...

       # find . -type f -exec basename {} \; | awk -F"." '{print $NF}' > /tmp/outfile.txt
       # cat /tmp/outfile.txt | sort | uniq -c| sort -n > tmp/outfile_sorted.txt
    
    0 讨论(0)
提交回复
热议问题