How to make a comparison between sentences and calculate the similarity?

佐手、 提交于 2021-02-05 07:36:01

问题


How to make a comparison between the first sentence of the second sentence and the first sentence with the third sentence and so on, and calculate the similarity using shell script or bash

I have a sentences containing duplicate words, for example, the input data in file my_text.txt and should ignore duplicated words per sentence, filler words, and non-alphabetical characters.

Shell Script
Linux Shell Script
Shell or bash are fun

I used this shell script to find similarity

  words=$(
  < my_text.txt tr 'A-Z' 'a-z' |
  grep -Eon '\b[a-z]*\b' |
  grep -Fwvf <(printf %s\\n is a to be by the and for) |
  sort -u | cut -d: -f2 | sort
  )
  union=$(uniq <<< "$words" | wc -l)
  intersection=$(uniq -d <<< "$words" | wc -l)
  echo "similarity is $(bc -l <<< "$intersection/$union")"

The script Above calculates similarity for all sentences one time, but I want to find want all pairs of similarities (e.g. 1:2, 1:3, 1:4, …, 2:3, 2:4, …, 3:4, ...)

I want to find similarity like this 2 example:

  • for the first and second sentences:
  • The intersection of both sentences: Shell + Script
  • The union " size " of both sentences: 3
  • The similarity : 0.66666666

--

  • for the first and third sentences :
  • The intersection of both sentences: Shell
  • The union " size " of both sentences: 4
  • The similarity : 0.25

can somebody help?


回答1:


With a small tweak to my answer to your previous question, still using GNU awk for FPAT and arrays of arrays:

$ cat tst.awk
BEGIN {
    split("is a to be by the and for",tmp)
    for (i in tmp) {
        stopwords[tmp[i]]
    }
    FPAT="[[:alnum:]_]+"
}
{
    for (i=1; i<=NF; i++) {
        word = tolower($i)
        if ( !(word in stopwords) ) {
            words[NR>1?2:1][word]
        }
    }
}
NR > 1 {
    numCommon = 0
    for (word in words[1]) {
        if (word in words[2]) {
            numCommon++
        }
    }
    totWords = length(words[1]) + length(words[2]) - numCommon
    print (totWords ? numCommon / totWords : 0)
    delete words[2]
}

$ awk -f tst.awk file
0.666667
0.166667


来源:https://stackoverflow.com/questions/65373832/how-to-make-a-comparison-between-sentences-and-calculate-the-similarity

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!