问题
Indo Cheap has a sample file like
XYZAcc
ABCAccounting
Accounting firm
Accounting Aco
Accounting Acompany
Acoustical consultant
He needs to get the most occurring sequences of 3 letters within a word.
Output should be
acc = 5 aco = 3
He asks if that is possible in bash.
He says: "I got absolutely no idea how I can accomplish it with either awk, sed, grep.
Any clue how it's possible..."
回答1:
This absolutely possible with bash, sed and awk, and here is how to do it:
#!/bin/bash
for line in $(cat sample | tr 'A-Z' 'a-z' | tr -s ' ' '\n'); do
ll=${#line}
for ((i = 0; i < ll - 2; i++)) ; do # for each word
echo ${line:i:3} # print all sequences of 3 letters
done
done |
sort | # sort the sequences of three letters
uniq -c | # count the sequences
sed '/^ *1 /d' | # filter out the not repeated sequences
sort -n -r | # most frequent sequences first
awk -F ' ' '{print $2" = "$1}' | # format output as asked
tr '\n' ' ' # put all results on one line
echo # add a new line at the end
And the ouput of the sample above is:
cou = 5 acc = 5 unt = 4 tin = 4 oun = 4 nti = 4 ing = 4 cco = 4 aco = 3
In case another format of output is wanted, we can easily adapt the code of the script according to the needs.
来源:https://stackoverflow.com/questions/59724585/how-can-i-count-most-occuring-sequences-of-3-letters-within-a-word-with-a-bash-s