How to create a frequency list of every word in a file?

前端 未结 11 953
心在旅途
心在旅途 2020-12-04 09:59

I have a file like this:

This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.

I

相关标签:
11条回答
  • 2020-12-04 10:39

    Not sed and grep, but tr, sort, uniq, and awk:

    % (tr ' ' '\n' | sort | uniq -c | awk '{print $2"@"$1}') <<EOF
    This is a file with many words.
    Some of the words appear more than once.
    Some of the words only appear one time.
    EOF
    
    a@1
    appear@2
    file@1
    is@1
    many@1
    more@1
    of@2
    once.@1
    one@1
    only@1
    Some@2
    than@1
    the@2
    This@1
    time.@1
    with@1
    words@2
    words.@1
    
    0 讨论(0)
  • 2020-12-04 10:41
    #!/usr/bin/env bash
    
    declare -A map 
    words="$1"
    
    [[ -f $1 ]] || { echo "usage: $(basename $0 wordfile)"; exit 1 ;}
    
    while read line; do 
      for word in $line; do 
        ((map[$word]++))
      done; 
    done < <(cat $words )
    
    for key in ${!map[@]}; do 
      echo "the word $key appears ${map[$key]} times"
    done|sort -nr -k5
    
    0 讨论(0)
  • 2020-12-04 10:44
      awk '{ 
           BEGIN{word[""]=0;}
        {
        for (el =1 ; el <= NF ; ++el) {word[$el]++ }
        }
     END {
     for (i in word) {
            if (i !="") 
               {
                  print word[i],i;
               }
                     }
     }' file.txt | sort -nr
    
    0 讨论(0)
  • 2020-12-04 10:45

    uniq -c already does what you want, just sort the input:

    echo 'a s d s d a s d s a a d d s a s d d s a' | tr ' ' '\n' | sort | uniq -c
    

    output:

      6 a
      7 d
      7 s
    
    0 讨论(0)
  • 2020-12-04 10:47

    If I have the following text in my file.txt.

    This is line number one
    This is Line Number Tow
    this is Line Number tow
    

    I can find the frequency of each word using the following cmd.

     cat file.txt | tr ' ' '\n' | sort | uniq -c
    

    output :

      3 is
      1 line
      2 Line
      1 number
      2 Number
      1 one
      1 this
      2 This
      1 tow
      1 Tow
    
    0 讨论(0)
  • 2020-12-04 10:55

    Let's do it in Python 3!

    """Counts the frequency of each word in the given text; words are defined as
    entities separated by whitespaces; punctuations and other symbols are ignored;
    case-insensitive; input can be passed through stdin or through a file specified
    as an argument; prints highest frequency words first"""
    
    # Case-insensitive
    # Ignore punctuations `~!@#$%^&*()_-+={}[]\|:;"'<>,.?/
    
    import sys
    
    # Find if input is being given through stdin or from a file
    lines = None
    if len(sys.argv) == 1:
        lines = sys.stdin
    else:
        lines = open(sys.argv[1])
    
    D = {}
    for line in lines:
        for word in line.split():
            word = ''.join(list(filter(
                lambda ch: ch not in "`~!@#$%^&*()_-+={}[]\\|:;\"'<>,.?/",
                word)))
            word = word.lower()
            if word in D:
                D[word] += 1
            else:
                D[word] = 1
    
    for word in sorted(D, key=D.get, reverse=True):
        print(word + ' ' + str(D[word]))
    

    Let's name this script "frequency.py" and add a line to "~/.bash_aliases":

    alias freq="python3 /path/to/frequency.py"
    

    Now to find the frequency words in your file "content.txt", you do:

    freq content.txt
    

    You can also pipe output to it:

    cat content.txt | freq
    

    And even analyze text from multiple files:

    cat content.txt story.txt article.txt | freq
    

    If you are using Python 2, just replace

    • ''.join(list(filter(args...))) with filter(args...)
    • python3 with python
    • print(whatever) with print whatever
    0 讨论(0)
提交回复
热议问题