How to create a frequency list of every word in a file?

前端未结

关注

 11  967

I have a file like this:

This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.

相关标签:

11条回答

独厮守ぢ

2020-12-04 10:39

Not sed and grep, but tr, sort, uniq, and awk:

% (tr ' ' '\n' | sort | uniq -c | awk '{print $2"@"$1}') <<EOF
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
EOF

a@1
appear@2
file@1
is@1
many@1
more@1
of@2
once.@1
one@1
only@1
Some@2
than@1
the@2
This@1
time.@1
with@1
words@2
words.@1

0 讨论(0)

情歌与酒

2020-12-04 10:41

#!/usr/bin/env bash

declare -A map 
words="$1"

[[ -f $1 ]] || { echo "usage: $(basename $0 wordfile)"; exit 1 ;}

while read line; do 
  for word in $line; do 
    ((map[$word]++))
  done; 
done < <(cat $words )

for key in ${!map[@]}; do 
  echo "the word $key appears ${map[$key]} times"
done|sort -nr -k5

0 讨论(0)

梦谈多话

2020-12-04 10:44

  awk '{ 
       BEGIN{word[""]=0;}
    {
    for (el =1 ; el <= NF ; ++el) {word[$el]++ }
    }
 END {
 for (i in word) {
        if (i !="") 
           {
              print word[i],i;
           }
                 }
 }' file.txt | sort -nr

0 讨论(0)

南旧

2020-12-04 10:45
uniq -c already does what you want, just sort the input:
```
echo 'a s d s d a s d s a a d d s a s d d s a' | tr ' ' '\n' | sort | uniq -c
```
output:
```
  6 a
  7 d
  7 s
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

执念已碎

2020-12-04 10:47

If I have the following text in my file.txt.

This is line number one
This is Line Number Tow
this is Line Number tow

I can find the frequency of each word using the following cmd.

 cat file.txt | tr ' ' '\n' | sort | uniq -c

output :

  3 is
  1 line
  2 Line
  1 number
  2 Number
  1 one
  1 this
  2 This
  1 tow
  1 Tow

0 讨论(0)

Happy的楠姐

2020-12-04 10:55

Let's do it in Python 3!

"""Counts the frequency of each word in the given text; words are defined as
entities separated by whitespaces; punctuations and other symbols are ignored;
case-insensitive; input can be passed through stdin or through a file specified
as an argument; prints highest frequency words first"""

# Case-insensitive
# Ignore punctuations `~!@#$%^&*()_-+={}[]\|:;"'<>,.?/

import sys

# Find if input is being given through stdin or from a file
lines = None
if len(sys.argv) == 1:
    lines = sys.stdin
else:
    lines = open(sys.argv[1])

D = {}
for line in lines:
    for word in line.split():
        word = ''.join(list(filter(
            lambda ch: ch not in "`~!@#$%^&*()_-+={}[]\\|:;\"'<>,.?/",
            word)))
        word = word.lower()
        if word in D:
            D[word] += 1
        else:
            D[word] = 1

for word in sorted(D, key=D.get, reverse=True):
    print(word + ' ' + str(D[word]))

Let's name this script "frequency.py" and add a line to "~/.bash_aliases":

alias freq="python3 /path/to/frequency.py"

Now to find the frequency words in your file "content.txt", you do:

freq content.txt

You can also pipe output to it:

cat content.txt | freq

And even analyze text from multiple files:

cat content.txt story.txt article.txt | freq

If you are using Python 2, just replace

''.join(list(filter(args...))) with filter(args...)
python3 with python
print(whatever) with print whatever

0 讨论(0)

1 2 下一页