How to create a frequency list of every word in a file?

前端 未结 11 957
心在旅途
心在旅途 2020-12-04 09:59

I have a file like this:

This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.

I

相关标签:
11条回答
  • 2020-12-04 10:58

    This might work for you:

    tr '[:upper:]' '[:lower:]' <file |
    tr -d '[:punct:]' |
    tr -s ' ' '\n' | 
    sort |
    uniq -c |
    sed 's/ *\([0-9]*\) \(.*\)/\2@\1/'
    
    0 讨论(0)
  • 2020-12-04 11:01

    You can use tr for this, just run

    tr ' ' '\12' <NAME_OF_FILE| sort | uniq -c | sort -nr > result.txt
    

    Sample Output for a text file of city names:

    3026 Toronto
    2006 Montréal
    1117 Edmonton
    1048 Calgary
    905 Ottawa
    724 Winnipeg
    673 Vancouver
    495 Brampton
    489 Mississauga
    482 London
    467 Hamilton
    
    0 讨论(0)
  • 2020-12-04 11:01

    Let's use AWK!

    This function lists the frequency of each word occurring in the provided file in Descending order:

    function wordfrequency() {
      awk '
         BEGIN { FS="[^a-zA-Z]+" } {
             for (i=1; i<=NF; i++) {
                 word = tolower($i)
                 words[word]++
             }
         }
         END {
             for (w in words)
                  printf("%3d %s\n", words[w], w)
         } ' | sort -rn
    }
    

    You can call it on your file like this:

    $ cat your_file.txt | wordfrequency
    

    Source: AWK-ward Ruby

    0 讨论(0)
  • 2020-12-04 11:05

    Content of the input file

    $ cat inputFile.txt
    This is a file with many words.
    Some of the words appear more than once.
    Some of the words only appear one time.
    

    Using sed | sort | uniq

    $ sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' inputFile.txt | sort | uniq -c
          1 a
          2 appear
          1 file
          1 is
          1 many
          1 more
          2 of
          1 once
          1 one
          1 only
          2 some
          1 than
          2 the
          1 this
          1 time
          1 with
          3 words
    

    uniq -ic will count and ignore case, but result list will have This instead of this.

    0 讨论(0)
  • 2020-12-04 11:05

    The sort requires GNU AWK (gawk). If you have another AWK without asort(), this can be easily adjusted and then piped to sort.

    awk '{gsub(/\./, ""); for (i = 1; i <= NF; i++) {w = tolower($i); count[w]++; words[w] = w}} END {qty = asort(words); for (w = 1; w <= qty; w++) print words[w] "@" count[words[w]]}' inputfile
    

    Broken out onto multiple lines:

    awk '{
        gsub(/\./, ""); 
        for (i = 1; i <= NF; i++) {
            w = tolower($i); 
            count[w]++; 
            words[w] = w
        }
    } 
    END {
        qty = asort(words); 
        for (w = 1; w <= qty; w++)
            print words[w] "@" count[words[w]]
    }' inputfile
    
    0 讨论(0)
提交回复
热议问题