问题
How to make a comparison between the first sentence of the second sentence and the first sentence with the third sentence and so on, and calculate the similarity using shell script
or bash
I have a sentences containing duplicate words, for example, the input data in file my_text.txt
and should ignore duplicated words per sentence, filler words, and non-alphabetical characters.
Shell Script
Linux Shell Script
Shell or bash are fun
I used this shell script to find similarity
words=$(
< my_text.txt tr 'A-Z' 'a-z' |
grep -Eon '\b[a-z]*\b' |
grep -Fwvf <(printf %s\\n is a to be by the and for) |
sort -u | cut -d: -f2 | sort
)
union=$(uniq <<< "$words" | wc -l)
intersection=$(uniq -d <<< "$words" | wc -l)
echo "similarity is $(bc -l <<< "$intersection/$union")"
The script Above calculates similarity for all sentences one time, but I want to find want all pairs of similarities (e.g. 1:2, 1:3, 1:4, …, 2:3, 2:4, …, 3:4, ...)
I want to find similarity like this 2 example:
- for the first and second sentences:
- The intersection of both sentences:
Shell + Script
- The union " size " of both sentences:
3
- The similarity :
0.66666666
--
- for the first and third sentences :
- The intersection of both sentences:
Shell
- The union " size " of both sentences:
4
- The similarity :
0.25
can somebody help?
回答1:
With a small tweak to my answer to your previous question, still using GNU awk for FPAT and arrays of arrays:
$ cat tst.awk
BEGIN {
split("is a to be by the and for",tmp)
for (i in tmp) {
stopwords[tmp[i]]
}
FPAT="[[:alnum:]_]+"
}
{
for (i=1; i<=NF; i++) {
word = tolower($i)
if ( !(word in stopwords) ) {
words[NR>1?2:1][word]
}
}
}
NR > 1 {
numCommon = 0
for (word in words[1]) {
if (word in words[2]) {
numCommon++
}
}
totWords = length(words[1]) + length(words[2]) - numCommon
print (totWords ? numCommon / totWords : 0)
delete words[2]
}
$ awk -f tst.awk file
0.666667
0.166667
来源:https://stackoverflow.com/questions/65373832/how-to-make-a-comparison-between-sentences-and-calculate-the-similarity