问题
I am trying to calculate similarity of rows of one document term matrix with rows of another document term matrix.
A <- data.frame(name = c(
"X-ray right leg arteries",
"x-ray left shoulder",
"x-ray leg arteries",
"x-ray leg with 20km distance"
), stringsAsFactors = F)
B <- data.frame(name = c(
"X-ray left leg arteries",
"X-ray leg",
"xray right leg",
"X-ray right leg arteries"
), stringsAsFactors = F)
corp1 <- corpus(A, text_field = "name")
corp2 <- corpus(B, text_field = "name")
docnames(corp1) <- paste("A", seq_len(ndoc(corp1)), sep = ".")
docnames(corp2) <- paste("B", seq_len(ndoc(corp2)), sep = ".")
dtm3 <- rbind(dfm(corp1, ngrams=2), dfm(corp2, ngrams=2))
d1 = textstat_simil(dtm3, method = "cosine")
d1 = as.matrix(d1)
d1 = d1[grepl("^A.",row.names(d1)),grepl("^B.",colnames(d1))]
In the code I am calculating similarity on combined matrix and later removing irrelevant cells from the matrix. Is it possible to compare one document from A at a time in textstat_simil(dtm3, method = "cosine")
? Below the table I am looking for. Also the file size of the matrix got doubled when I use as.matrix(d1)
.
B.1 B.2 B.3 B.4
A.1 0.3333333 0.0000000 0.4082483 1.0000000
A.2 0.4082483 0.0000000 0.0000000 0.0000000
A.3 0.4082483 0.7071068 0.0000000 0.4082483
A.4 0.0000000 0.5000000 0.0000000 0.0000000
回答1:
This will work, although as you point out, it doubles the cosine similarity matrix size in coercing the dist
class return from textstat_simil()
into a matrix
.
d2 <- textstat_simil(dtm3, method = "cosine", diag = TRUE)
as.matrix(d2)[docnames(corp1), docnames(corp2)]
# B.1 B.2 B.3 B.4
# A.1 0.3333333 0.0000000 0.4082483 1.0000000
# A.2 0.4082483 0.0000000 0.0000000 0.0000000
# A.3 0.4082483 0.7071068 0.0000000 0.4082483
# A.4 0.0000000 0.5000000 0.0000000 0.0000000
Note that your use of ngrams=2
in the creation of dtm3
will create a dfm from only bigram features (which are quire infrequent). If you want unigrams as well as bigrams, then this should be ngrams = 1:2
instead.
That should work pretty well for most problems. If you are worried about the size of your object, you can either loop across individual selections of the dtm3
, building up the target object, or lapply()
the comparisons as follows (but this is much less efficient):
cosines <- lapply(docnames(corp2),
function(x) textstat_simil(dtm3[c(x, docnames(corp1)), ],
method = "cosine",
selection = x)[-1, , drop = FALSE])
do.call(cbind, cosines)
# B.1 B.2 B.3 B.4
# A.1 0.3333333 0.0000000 0.4082483 1.0000000
# A.2 0.4082483 0.0000000 0.0000000 0.0000000
# A.3 0.4082483 0.7071068 0.0000000 0.4082483
# A.4 0.0000000 0.5000000 0.0000000 0.0000000
来源:https://stackoverflow.com/questions/48845052/pairwise-distance-between-documents