Merging word counts with Bash and Unix

问题

I made a Bash script that extracts words from a text file with grep and sed and then sorts them with sort and counts the repetitions with wc, then sort again by frequency. The example output looks like this:

12 the
 7 code
 7 with
 7 add
 5 quite
 3 do
 3 well
 1 quick
 1 can
 1 pick
 1 easy

Now I'd like to merge all words with the same frequency into one line, like this:

12 the
 7 code with add
 5 quite
 3 do well
 1 quick can pick easy

Is there any way to do that with Bash and standard Unix toolset? Or I would have to write a script / program in some more sophisticated scripting language?

回答1:

With awk:

$ echo "12 the
 7 code
 7 with
 7 add
 5 quite
 3 do
 3 well
 1 quick
 1 can
 1 pick
 1 easy" | awk '{cnt[$1]=cnt[$1] ? cnt[$1] OFS $2 : $2} END {for (e in cnt) print e, cnt[e]} ' | sort -nr
12 the
7 code with add
5 quite
3 do well
1 quick can pick easy

You can do something similar with Bash 4 associative arrays. awk is easier and POSIX though. Use that.

Explanation:

awk splits the line apart by the separator in FS, in this case the default of horizontal whitespace;
$1 is the first field of the count - use that to collect items with the same count in an associative array keyed by the count with cnt[$1];
cnt[$1]=cnt[$1] ? cnt[$1] OFS $2 : $2 is a ternary assignment - if cnt[$1] has no value, just assign the second field $2 to it (The RH of :). If it does have a previous value, concatenate $2 separated by the value of OFS (the LH of :);
At the end, print out the value of the associative array.

Since awk associative arrays are unordered, you need to sort again by the numeric value of the first column. gawk can sort internally, but it is just as easy to call sort. The input to awk does not need to be sorted, so you can eliminate that part of the pipeline.

If you want the digits to be right justified (as your have in your example):

$ awk '{cnt[$1]=cnt[$1] ? cnt[$1] OFS $2 : $2} 
     END {for (e in cnt) printf "%3s %s\n", e, cnt[e]} '

If you want gawk to sort numerically by descending values, you can add PROCINFO["sorted_in"]="@ind_num_desc" prior to traversing the array:

$ gawk '{cnt[$1]=cnt[$1] ? cnt[$1] OFS $2 : $2} 
            END {PROCINFO["sorted_in"]="@ind_num_desc"
               for (e in cnt) printf "%3s %s\n", e, cnt[e]} '

回答2:

With single GNU awk expression (without sort pipeline):

awk 'BEGIN{ PROCINFO["sorted_in"]="@ind_num_desc" }
     { a[$1]=(a[$1])? a[$1]" "$2:$2 }END{ for(i in a) print i,a[i]}' file

The output:

12 the
7 code with add
5 quite
3 do well
1 quick can pick easy

Bonus alternative solution using GNU datamash tool:

datamash -W -g1 collapse 2 <file

The output (comma-separated collapsed fields):

12  the
7   code,with,add
5   quite
3   do,well
1   quick,can,pick,easy

回答3:

awk:

awk '{a[$1]=a[$1] FS $2}!b[$1]++{d[++c]=$1}END{while(i++<c)print d[i],a[d[i]]}' file

sed:

sed -r ':a;N;s/(\b([0-9]+).*)\n\s*\2/\1/;ta;P;D'

回答4:

You start with sorted data, so you only need a new line when the first field changes.

echo "12 the
 7 code
 7 with
 7 add
 5 quite
 3 do
 3 well
 1 quick
 1 can
 1 pick
 1 easy" |
awk '
   {
      if ($1==last) { 
         printf(" %s",$2) 
      } else { 
         last=$1;
         printf("%s%s",(NR>1?"\n":""),$0)
      }
    }; END {print}'

回答5:

next time you find yourself trying to manipulate text with a combination of grep and sed and shell and..., stop and just use awk instead - the end result will be clearer, simpler, more efficient, more portable, etc...

$ cat file
It was the best of times, it was the worst of times,
it was the age of wisdom, it was the age of foolishness.

$ cat tst.awk
BEGIN { FS="[^[:alpha:]]+" }
{
    for (i=1; i<NF; i++) {
        word2cnt[tolower($i)]++
    }
}
END {
    for (word in word2cnt) {
        cnt = word2cnt[word]
        cnt2words[cnt] = (cnt in cnt2words ? cnt2words[cnt] " " : "") word
        printf "%3d %s\n", cnt, word
    }
    for (cnt in cnt2words) {
        words = cnt2words[cnt]
        # printf "%3d %s\n", cnt, words
    }
}
$
$ awk -f tst.awk file | sort -rn
  4 was
  4 the
  4 of
  4 it
  2 times
  2 age
  1 worst
  1 wisdom
  1 foolishness
  1 best

$ cat tst.awk
BEGIN { FS="[^[:alpha:]]+" }
{
    for (i=1; i<NF; i++) {
        word2cnt[tolower($i)]++
    }
}
END {
    for (word in word2cnt) {
        cnt = word2cnt[word]
        cnt2words[cnt] = (cnt in cnt2words ? cnt2words[cnt] " " : "") word
        # printf "%3d %s\n", cnt, word
    }
    for (cnt in cnt2words) {
        words = cnt2words[cnt]
        printf "%3d %s\n", cnt, words
    }
}
$
$ awk -f tst.awk file | sort -rn
  4 it was of the
  2 age times
  1 best worst wisdom foolishness

Just uncomment whichever printf line you like in the above script to get whichever type of output you want. The above will work in any awk on any UNIX system.

回答6:

Using miller's nest verb:

mlr -p  nest --implode --values --across-records -f 2 --nested-fs ' ' file

Output:

12 the
7 code with add
5 quite
3 do well
1 quick can pick easy

来源：https://stackoverflow.com/questions/46027733/merging-word-counts-with-bash-and-unix

标签

bash

shell

unix

sed

grep