How to compute similarity in quanteda between documents for adjacent years only, within groups?

问题

I have a diachronic corpus with texts for different organizations, each for years 1969 to 2019. For each organization, I want to compare text for year 1969 and text for 1970, 1970 and 1971, etc. Texts for some years are missing.

In other words,

I have a corpus, cc, which I converted to a dfm Now I want to use textstat_simil:

ncsimil <- textstat_simil(dfm.cc, 
                           y = NULL,
                           selection = NULL,
                           margin = "documents",
                           method = "jaccard",
                           min_simil = NULL)

This compares every text with every other text, resulting in a 2.6+ million lines. I really only need to compare certain texts with the text immediately above, like this:

TextA

TextB

TextC

TextD (has NA)

TextE

So, I want the jaccard statistic for A and B

B and C, and (since some have NA values)

D and E

I am curious about the y = in textstat_simil

Quanteda package says

"y is an optional target matrix matching x in the margin on which the similarity or distance will be computed."

It is not clear to me what this means.

Does it mean I can create two different data frames

and

So that I will get a similarity statistic for

A and B

B and C

and so forth?

Or is there a better way to do this?

Edited starting here... I converted to a data.frame:

df <- convert(dfm.cc, to = "data.frame")

I did bind_cols to add docvars and token counts (2,405 columns -- short texts).

I have isolated the initial texts in a series, e.g.,

OrgA 1970, 1st_in_Series_Yes, TokCount 1...etc.

OrgA 1971, 1st_in_Series_No, TokCount 1...etc.

OrgA 1972, 1st_in_Series_No, TokCount 1...etc.

OrgA 1973, NA

OrgA 1974, 1st_in_Series_Yes, TokCount 1...etc.

OrgZ 1975, 1st_in_Series_No, TokCount 1...etc.

So as not to compare

OrgA 1973 NA with OrgA 1972

OrgA 1974 with OrgA 1973

Manually computing Jaccard should work from here, but there's probably a smarter way. Please share solutions. Thanks.

回答1:

Interesting question. I don't have a reproducible example to work with, but I think I can create one using the built-in inaugural corpus dataset. Here, I will use the document variables Year for the time variable, and the unique president (full) name as an analog for your organization (since you don't want year-to-year comparisons of different organizations. So if you substitute your organization and time variable for the ones below this should work.

Note that I make the outer "loop" an lapply, and the inner is an actual loop, but there are clever ways to make the inner part also an lapply. Here I've left it as a for loop for simplicity.

First, get a unique name, since some (different) presidents share the same last names.

library("quanteda")
## Package version: 2.0.1

data_corpus_inaugural$president <- paste(data_corpus_inaugural$President,
  data_corpus_inaugural$FirstName,
  sep = ", "
)
head(data_corpus_inaugural$president, 10)
##  [1] "Washington, George" "Washington, George" "Adams, John"       
##  [4] "Jefferson, Thomas"  "Jefferson, Thomas"  "Madison, James"    
##  [7] "Madison, James"     "Monroe, James"      "Monroe, James"     
## [10] "Adams, John Quincy"

If we make that set unique, then we can iterate across the unique presidents to subset them one at a time. (This is what you will do with each of your organizations.) We can do this using corpus_subset() before creating the dfm, and within that, select just adjacent year pairs. The sorting of the years means that the i and i+1 will be adjacent. Most of the presidents have only a pair of years, but Franklin Roosevelt who had four inaugural addresses has three pairs. And single-term presidents, such as Carter 1977, do not have any pairs.

simpairs <- lapply(unique(data_corpus_inaugural$president), function(x) {
  dfmat <- corpus_subset(data_corpus_inaugural, president == x) %>%
    dfm(remove_punct = TRUE)
  df <- data.frame()
  years <- sort(dfmat$Year)
  for (i in seq_along(years)[-length(years)]) {
    sim <- textstat_simil(
      dfm_subset(dfmat, Year %in% c(years[i], years[i + 1])),
      method = "jaccard"
    )
    df <- rbind(df, as.data.frame(sim))
  }
  df
})

Now when we join them, you can see that we have computed only what we need.

do.call(rbind, simpairs)
##          document1       document2    jaccard
## 1  1789-Washington 1793-Washington 0.09250399
## 2   1801-Jefferson  1805-Jefferson 0.20512821
## 3     1809-Madison    1813-Madison 0.20138889
## 4      1817-Monroe     1821-Monroe 0.29436202
## 5     1829-Jackson    1833-Jackson 0.20693928
## 6     1861-Lincoln    1865-Lincoln 0.14055885
## 7       1869-Grant      1873-Grant 0.20981595
## 8   1885-Cleveland  1893-Cleveland 0.23037543
## 9    1897-McKinley   1901-McKinley 0.25031211
## 10     1913-Wilson     1917-Wilson 0.21285564
## 11  1933-Roosevelt  1937-Roosevelt 0.20956522
## 12  1937-Roosevelt  1941-Roosevelt 0.20081549
## 13  1941-Roosevelt  1945-Roosevelt 0.18740157
## 14 1953-Eisenhower 1957-Eisenhower 0.21566976
## 15      1969-Nixon      1973-Nixon 0.23451777
## 16     1981-Reagan     1985-Reagan 0.24381368
## 17    1993-Clinton    1997-Clinton 0.24199623
## 18       2001-Bush       2005-Bush 0.24170616
## 19      2009-Obama      2013-Obama 0.24739195

For computing similarity you might want to add more options to the dfm creation line - I only removed punctuation here but you could also remove stopwords, numbers, etc. if that is what you want.

来源：https://stackoverflow.com/questions/61626262/how-to-compute-similarity-in-quanteda-between-documents-for-adjacent-years-only

标签

similarity

corpus

quanteda