quanteda

How to compute similarity in quanteda between documents for adjacent years only, within groups?

大兔子大兔子 提交于 2021-02-11 06:17:46
问题 I have a diachronic corpus with texts for different organizations, each for years 1969 to 2019. For each organization, I want to compare text for year 1969 and text for 1970, 1970 and 1971, etc. Texts for some years are missing. In other words, I have a corpus, cc, which I converted to a dfm Now I want to use textstat_simil : ncsimil <- textstat_simil(dfm.cc, y = NULL, selection = NULL, margin = "documents", method = "jaccard", min_simil = NULL) This compares every text with every other text,

How to compute similarity in quanteda between documents for adjacent years only, within groups?

给你一囗甜甜゛ 提交于 2021-02-11 06:17:42
问题 I have a diachronic corpus with texts for different organizations, each for years 1969 to 2019. For each organization, I want to compare text for year 1969 and text for 1970, 1970 and 1971, etc. Texts for some years are missing. In other words, I have a corpus, cc, which I converted to a dfm Now I want to use textstat_simil : ncsimil <- textstat_simil(dfm.cc, y = NULL, selection = NULL, margin = "documents", method = "jaccard", min_simil = NULL) This compares every text with every other text,

How to compute similarity in quanteda between documents for adjacent years only, within groups?

谁说胖子不能爱 提交于 2021-02-11 06:17:17
问题 I have a diachronic corpus with texts for different organizations, each for years 1969 to 2019. For each organization, I want to compare text for year 1969 and text for 1970, 1970 and 1971, etc. Texts for some years are missing. In other words, I have a corpus, cc, which I converted to a dfm Now I want to use textstat_simil : ncsimil <- textstat_simil(dfm.cc, y = NULL, selection = NULL, margin = "documents", method = "jaccard", min_simil = NULL) This compares every text with every other text,

Feature selection in document-feature matrix by using chi-squared test

送分小仙女□ 提交于 2021-02-06 12:50:43
问题 I am doing texting mining using natural language processing. I used quanteda package to generate a document-feature matrix (dfm). Now I want to do feature selection using a chi-square test. I know there were already a lot of people asked this question. However, I couldn't find the relevant code for that. (The answers just gave a brief concept, like this: https://stats.stackexchange.com/questions/93101/how-can-i-perform-a-chi-square-test-to-do-feature-selection-in-r) I learned that I could use

Feature selection in document-feature matrix by using chi-squared test

三世轮回 提交于 2021-02-06 12:49:32
问题 I am doing texting mining using natural language processing. I used quanteda package to generate a document-feature matrix (dfm). Now I want to do feature selection using a chi-square test. I know there were already a lot of people asked this question. However, I couldn't find the relevant code for that. (The answers just gave a brief concept, like this: https://stats.stackexchange.com/questions/93101/how-can-i-perform-a-chi-square-test-to-do-feature-selection-in-r) I learned that I could use

Keep the word frequency and inverse for one type of documents

匆匆过客 提交于 2021-01-29 20:47:08
问题 Code example to keep the term and inverse frequency: library(dplyr) library(janeaustenr) library(tidytext) book_words <- austen_books() %>% unnest_tokens(word, text) %>% count(book, word, sort = TRUE) total_words <- book_words %>% group_by(book) %>% summarize(total = sum(n)) book_words <- left_join(book_words, total_words) book_words <- book_words %>% bind_tf_idf(word, book, n) book_words %>% select(-total) %>% arrange(desc(tf_idf)) My problem is that this example uses multiple books. I have

How to convert DFM into dataframe BUT keeping docvars?

青春壹個敷衍的年華 提交于 2021-01-28 22:16:32
问题 I am using the quanteda package and the very good tutorials that have been written about it to make various operations on paper articles. I obtained the frequency of specific words over time by selecting them in a mainwordsDFM and using textstat_frequency(mainwordsDFM, group = "Date") , then converted the result into a dataframe, and plotted with ggplot. However, I now try to plot the frequency of a word over time and by paper . The solution I used on my previous operation does not work in

Is it possible to use `kwic` function to find words near to each other?

你说的曾经没有我的故事 提交于 2021-01-28 08:24:46
问题 I found this reference : https://www.safaribooksonline.com/library/view/regular-expressions-cookbook/9781449327453/ch05s07.html Is it possible to use it with kwic function in the quanteda package to be able to find documents in a corpus containing words that are not "stuck" but close to each other, with maybe a few other words between ? for example, if I give two words in the function, I would like to find the documents in a corpus where these two words occur but maybe with some words between

Keyword in context (kwic) for skipgrams?

蓝咒 提交于 2020-12-12 02:07:06
问题 I do keyword in context analysis with quanteda for ngrams and tokens and it works well. I now want to do it for skipgrams, capture the context of "barriers to entry" but also "barriers to [...] [and] entry. The following code a kwic object which is empty but I don't know what I did wrong. dcc.corpus refers to the text document. I also used the tokenized version but nothing changes. The result is: "kwic object with 0 rows" x <- tokens("barriers entry") ntoken_test <- tokens_ngrams(x, n = 2,

Keyword in context (kwic) for skipgrams?

末鹿安然 提交于 2020-12-12 02:06:18
问题 I do keyword in context analysis with quanteda for ngrams and tokens and it works well. I now want to do it for skipgrams, capture the context of "barriers to entry" but also "barriers to [...] [and] entry. The following code a kwic object which is empty but I don't know what I did wrong. dcc.corpus refers to the text document. I also used the tokenized version but nothing changes. The result is: "kwic object with 0 rows" x <- tokens("barriers entry") ntoken_test <- tokens_ngrams(x, n = 2,