How to calculate proximity of words to a specific term in a document

落花浮王杯 提交于 2019-12-01 10:55:54

I'd suggest solving this with a combination of my tidytext and fuzzyjoin packages.

You can start by tokenizing it into a one-row-per-word data frame, adding a position column, and removing stopwords:

library(tidytext)
library(dplyr)

all_words <- data_frame(text = song) %>%
  unnest_tokens(word, text) %>%
  mutate(position = row_number()) %>%
  filter(!word %in% tm::stopwords("en"))

You can then find just the word fire, and use difference_inner_join() from fuzzyjoin to find all rows within 15 words of those rows. You can then use group_by() and summarize() to get your desired statistics for each word.

library(fuzzyjoin)

nearby_words <- all_words %>%
  filter(word == "fire") %>%
  select(focus_term = word, focus_position = position) %>%
  difference_inner_join(all_words, by = c(focus_position = "position"), max_dist = 15) %>%
  mutate(distance = abs(focus_position - position))

words_summarized <- nearby_words %>%
  group_by(word) %>%
  summarize(number = n(),
            maximum_distance = max(distance),
            minimum_distance = min(distance),
            average_distance = mean(distance)) %>%
  arrange(desc(number))

Output in this case:

# A tibble: 49 × 5
       word number maximum_distance minimum_distance average_distance
      <chr>  <int>            <dbl>            <dbl>            <dbl>
 1     fire      3                0                0              0.0
 2    light      2               12                7              9.5
 3     moon      2               13                9             11.0
 4    bells      1               14               14             14.0
 5  beneath      1               11               11             11.0
 6   blazed      1               10               10             10.0
 7   crowns      1                5                5              5.0
 8     dale      1               15               15             15.0
 9   dragon      1                1                1              1.0
10 dragon’s      1                5                5              5.0
# ... with 39 more rows

Note that this approach also lets you perform the analysis on multiple focus words at once. All you'd have to do is change filter(word == "fire") to filter(word %in% c("fire", "otherword")), and change group_by(word) to group_by(focus_word, word).

The tidytext answer is a good one, but there are tools in quanteda that can be adapted for this. The main function to count within a window is not kwic() but rather fcm() (feature co-occurrence matrix).

require(quanteda)

# tokenize so that intra-word hyphens and punctuation are removed
toks <- tokens(song, remove_punct = TRUE, remove_hyphens = TRUE)

# all co-occurrences
head(fcm(toks, window = 15, context = "window", count = "frequency")[, "fire"])
## Feature co-occurrence matrix of: 155 by 1 feature.
## (showing first 6 documents and first feature)
##            features
## features    fire
##   Far          1
##   over         1
##   the          5
##   misty        1
##   mountains    0
##   cold         0

head(fcm(toks, window = 15, context = "window", count = "frequency")["light", "fire"])
## Feature co-occurrence matrix of: 1 by 1 feature.
## 1 x 1 sparse Matrix of class "fcm"
##         features
## features fire
##    light    2

To get the average distance of the words from the target requires a bit of a hack of the weights function for distance. Below, the weights are applied to consider the counts according to the position, which provides a weighted mean when these are summed and then divided by the total frequency within the window. For your example of "light", for instance:

# average distance
fcm(toks, window = 15, context = "window", count = "weighted", weights = 1:15)["light", "fire"] /
    fcm(toks, window = 15, context = "window", count = "frequency")["light", "fire"]
## 1 x 1 Matrix of class "dgeMatrix"
##         features
##    light  9.5
## features fire

Getting minimum and maximum position is a bit more complicated, and while I can figure out a way to "hack" this using a combination of the weights to position a binary mask in each position then converting that to a distance. (Too ungainly to show, so I'm recommending the tidy solution unless I think of a more elegant way.)

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!