I have a file like this:
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
I
This might work for you:
tr '[:upper:]' '[:lower:]' <file |
tr -d '[:punct:]' |
tr -s ' ' '\n' |
sort |
uniq -c |
sed 's/ *\([0-9]*\) \(.*\)/\2@\1/'
You can use tr for this, just run
tr ' ' '\12' <NAME_OF_FILE| sort | uniq -c | sort -nr > result.txt
Sample Output for a text file of city names:
3026 Toronto
2006 Montréal
1117 Edmonton
1048 Calgary
905 Ottawa
724 Winnipeg
673 Vancouver
495 Brampton
489 Mississauga
482 London
467 Hamilton
This function lists the frequency of each word occurring in the provided file in Descending order:
function wordfrequency() {
awk '
BEGIN { FS="[^a-zA-Z]+" } {
for (i=1; i<=NF; i++) {
word = tolower($i)
words[word]++
}
}
END {
for (w in words)
printf("%3d %s\n", words[w], w)
} ' | sort -rn
}
You can call it on your file like this:
$ cat your_file.txt | wordfrequency
Source: AWK-ward Ruby
Content of the input file
$ cat inputFile.txt
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
Using sed | sort | uniq
$ sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' inputFile.txt | sort | uniq -c
1 a
2 appear
1 file
1 is
1 many
1 more
2 of
1 once
1 one
1 only
2 some
1 than
2 the
1 this
1 time
1 with
3 words
uniq -ic
will count and ignore case, but result list will have This
instead of this
.
The sort requires GNU AWK (gawk
). If you have another AWK without asort()
, this can be easily adjusted and then piped to sort
.
awk '{gsub(/\./, ""); for (i = 1; i <= NF; i++) {w = tolower($i); count[w]++; words[w] = w}} END {qty = asort(words); for (w = 1; w <= qty; w++) print words[w] "@" count[words[w]]}' inputfile
Broken out onto multiple lines:
awk '{
gsub(/\./, "");
for (i = 1; i <= NF; i++) {
w = tolower($i);
count[w]++;
words[w] = w
}
}
END {
qty = asort(words);
for (w = 1; w <= qty; w++)
print words[w] "@" count[words[w]]
}' inputfile