How can I count most occuring sequences of 3 letters within a word with a bash script [duplicate]

拜拜、爱过 提交于 2020-01-17 12:45:30

问题


Indo Cheap has a sample file like

XYZAcc
ABCAccounting
Accounting firm
Accounting Aco
Accounting Acompany
Acoustical consultant

He needs to get the most occurring sequences of 3 letters within a word.

Output should be

acc = 5 aco = 3

He asks if that is possible in bash.

He says: "I got absolutely no idea how I can accomplish it with either awk, sed, grep.

Any clue how it's possible..."


回答1:


This absolutely possible with bash, sed and awk, and here is how to do it:

#!/bin/bash

for line in $(cat sample | tr 'A-Z' 'a-z' | tr -s ' ' '\n'); do
  ll=${#line}
  for ((i = 0; i < ll - 2; i++)) ; do   # for each word
    echo ${line:i:3}                    # print all sequences of 3 letters
  done
done | 
  sort |                                # sort the sequences of three letters
  uniq -c |                             # count the sequences
  sed '/^ *1 /d' |                      # filter out the not repeated sequences
  sort -n -r |                          # most frequent sequences first
  awk -F ' ' '{print $2" = "$1}' |      # format output as asked
  tr '\n' ' '                           # put all results on one line 
echo                                    # add a new line at the end

And the ouput of the sample above is:

cou = 5 acc = 5 unt = 4 tin = 4 oun = 4 nti = 4 ing = 4 cco = 4 aco = 3

In case another format of output is wanted, we can easily adapt the code of the script according to the needs.



来源:https://stackoverflow.com/questions/59724585/how-can-i-count-most-occuring-sequences-of-3-letters-within-a-word-with-a-bash-s

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!