I have a file like this:
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
I
Not sed
and grep
, but tr
, sort
, uniq
, and awk
:
% (tr ' ' '\n' | sort | uniq -c | awk '{print $2"@"$1}') <<EOF
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
EOF
a@1
appear@2
file@1
is@1
many@1
more@1
of@2
once.@1
one@1
only@1
Some@2
than@1
the@2
This@1
time.@1
with@1
words@2
words.@1
#!/usr/bin/env bash
declare -A map
words="$1"
[[ -f $1 ]] || { echo "usage: $(basename $0 wordfile)"; exit 1 ;}
while read line; do
for word in $line; do
((map[$word]++))
done;
done < <(cat $words )
for key in ${!map[@]}; do
echo "the word $key appears ${map[$key]} times"
done|sort -nr -k5
awk '{
BEGIN{word[""]=0;}
{
for (el =1 ; el <= NF ; ++el) {word[$el]++ }
}
END {
for (i in word) {
if (i !="")
{
print word[i],i;
}
}
}' file.txt | sort -nr
uniq -c already does what you want, just sort the input:
echo 'a s d s d a s d s a a d d s a s d d s a' | tr ' ' '\n' | sort | uniq -c
output:
6 a
7 d
7 s
If I have the following text in my file.txt.
This is line number one
This is Line Number Tow
this is Line Number tow
I can find the frequency of each word using the following cmd.
cat file.txt | tr ' ' '\n' | sort | uniq -c
output :
3 is
1 line
2 Line
1 number
2 Number
1 one
1 this
2 This
1 tow
1 Tow
Let's do it in Python 3!
"""Counts the frequency of each word in the given text; words are defined as
entities separated by whitespaces; punctuations and other symbols are ignored;
case-insensitive; input can be passed through stdin or through a file specified
as an argument; prints highest frequency words first"""
# Case-insensitive
# Ignore punctuations `~!@#$%^&*()_-+={}[]\|:;"'<>,.?/
import sys
# Find if input is being given through stdin or from a file
lines = None
if len(sys.argv) == 1:
lines = sys.stdin
else:
lines = open(sys.argv[1])
D = {}
for line in lines:
for word in line.split():
word = ''.join(list(filter(
lambda ch: ch not in "`~!@#$%^&*()_-+={}[]\\|:;\"'<>,.?/",
word)))
word = word.lower()
if word in D:
D[word] += 1
else:
D[word] = 1
for word in sorted(D, key=D.get, reverse=True):
print(word + ' ' + str(D[word]))
Let's name this script "frequency.py" and add a line to "~/.bash_aliases":
alias freq="python3 /path/to/frequency.py"
Now to find the frequency words in your file "content.txt", you do:
freq content.txt
You can also pipe output to it:
cat content.txt | freq
And even analyze text from multiple files:
cat content.txt story.txt article.txt | freq
If you are using Python 2, just replace
''.join(list(filter(args...)))
with filter(args...)
python3
with python
print(whatever)
with print whatever