On a Linux machine I would like to traverse a folder hierarchy and get a list of all of the distinct file extensions within it.
What would be the best way to achieve
No need for the pipe to sort
, awk can do it all:
find . -type f | awk -F. '!a[$NF]++{print $NF}'
My awk-less, sed-less, Perl-less, Python-less POSIX-compliant alternative:
find . -type f | rev | cut -d. -f1 | rev | tr '[:upper:]' '[:lower:]' | sort | uniq --count | sort -rn
The trick is that it reverses the line and cuts the extension at the beginning.
It also converts the extensions to lower case.
Example output:
3689 jpg
1036 png
610 mp4
90 webm
90 mkv
57 mov
12 avi
10 txt
3 zip
2 ogv
1 xcf
1 trashinfo
1 sh
1 m4v
1 jpeg
1 ini
1 gqv
1 gcs
1 dv
Find everythin with a dot and show only the suffix.
find . -type f -name "*.*" | awk -F. '{print $NF}' | sort -u
if you know all suffix have 3 characters then
find . -type f -name "*.???" | awk -F. '{print $NF}' | sort -u
or with sed shows all suffixes with one to four characters. Change {1,4} to the range of characters you are expecting in the suffix.
find . -type f | sed -n 's/.*\.\(.\{1,4\}\)$/\1/p'| sort -u
Since there's already another solution which uses Perl:
If you have Python installed you could also do (from the shell):
python -c "import os;e=set();[[e.add(os.path.splitext(f)[-1]) for f in fn]for _,_,fn in os.walk('/home')];print '\n'.join(e)"
None of the replies so far deal with filenames with newlines properly (except for ChristopheD's, which just came in as I was typing this). The following is not a shell one-liner, but works, and is reasonably fast.
import os, sys
def names(roots):
for root in roots:
for a, b, basenames in os.walk(root):
for basename in basenames:
yield basename
sufs = set(os.path.splitext(x)[1] for x in names(sys.argv[1:]))
for suf in sufs:
if suf:
print suf
I don't think this one was mentioned yet:
find . -type f -exec sh -c 'echo "${0##*.}"' {} \; | sort | uniq -c