corpus

How to compute similarity in quanteda between documents for adjacent years only, within groups?

大兔子大兔子 提交于 2021-02-11 06:17:46
问题 I have a diachronic corpus with texts for different organizations, each for years 1969 to 2019. For each organization, I want to compare text for year 1969 and text for 1970, 1970 and 1971, etc. Texts for some years are missing. In other words, I have a corpus, cc, which I converted to a dfm Now I want to use textstat_simil : ncsimil <- textstat_simil(dfm.cc, y = NULL, selection = NULL, margin = "documents", method = "jaccard", min_simil = NULL) This compares every text with every other text,

How to compute similarity in quanteda between documents for adjacent years only, within groups?

给你一囗甜甜゛ 提交于 2021-02-11 06:17:42
问题 I have a diachronic corpus with texts for different organizations, each for years 1969 to 2019. For each organization, I want to compare text for year 1969 and text for 1970, 1970 and 1971, etc. Texts for some years are missing. In other words, I have a corpus, cc, which I converted to a dfm Now I want to use textstat_simil : ncsimil <- textstat_simil(dfm.cc, y = NULL, selection = NULL, margin = "documents", method = "jaccard", min_simil = NULL) This compares every text with every other text,

How to compute similarity in quanteda between documents for adjacent years only, within groups?

谁说胖子不能爱 提交于 2021-02-11 06:17:17
问题 I have a diachronic corpus with texts for different organizations, each for years 1969 to 2019. For each organization, I want to compare text for year 1969 and text for 1970, 1970 and 1971, etc. Texts for some years are missing. In other words, I have a corpus, cc, which I converted to a dfm Now I want to use textstat_simil : ncsimil <- textstat_simil(dfm.cc, y = NULL, selection = NULL, margin = "documents", method = "jaccard", min_simil = NULL) This compares every text with every other text,

How to Extract keywords from a Data Frame in R

不打扰是莪最后的温柔 提交于 2021-02-08 06:33:35
问题 I am new to text-mining in R. I want to remove stopwords (i.e. extract keywords) from my data frame's column and put those keywords into a new column. I tried to make a corpus, but it didn't help me. df$C3 is what I currently have. I would like to add column df$C4 , but I can't get it to work. df <- structure(list(C3 = structure(c(3L, 4L, 1L, 7L, 6L, 9L, 5L, 8L, 10L, 2L), .Label = c("Are doing good", "For the help", "hello everyone", "hope you all", "I Hope", "I need help", "In life", "It

Combining/adding vectors from different word2vec models

吃可爱长大的小学妹 提交于 2020-06-17 03:53:05
问题 I am using gensim to create Word2Vec models trained on large text corpora. I have some models based on StackExchange data dumps. I also have a model trained on a corpus derived from English Wikipedia. Assume that a vocabulary term is in both models, and that the models were created with the same parameters to Word2Vec. Is there any way to combine or add the vectors from the two separate models to create a single new model that has the same word vectors that would have resulted if I had

CWB encoding Corpus

霸气de小男生 提交于 2020-01-14 03:48:18
问题 According to the Corpus Work Bench, to encode a corpus i need to use the cwb-encode perl script "encode the corpus, i.e. convert the verticalized text to CWB binary format with the cwb-encode tool. Note that the command below has to be entered on a single line." http://cogsci.uni-osnabrueck.de/~korpora/ws/CWBdoc/CWB_Encoding_Tutorial/node3.html $ cwb-encode -d /corpora/data/example -f example.vrt -R /usr/local/share/cwb/registry/example -P pos -S s when i tried it, it says the file is missing

Can you recommend a source of reference data for Fundamental matrix calculation

白昼怎懂夜的黑 提交于 2020-01-06 14:08:39
问题 Specifically I'd ideally want images with point correspondences and a 'Gold Standard' calculated value of F and left and right epipoles. I could work with an Essential matrix and intrinsic and extrinsic camera properties too. I know that I can construct F from two projection matrices and then generate left and right projected point coordinates from 3D actual points and apply Gaussian noise but I'd really like to work with someone else's reference data since I'm trying to test the efficacy of

Creating more complex regexes from TAG format

无人久伴 提交于 2020-01-06 14:07:57
问题 So I can't figure out what's wrong with my regex here. (The original conversation, which includes an explanation of these TAG formats, can be found here: Translate from TAG format to Regex for Corpus). I am starting with a string like this: Arms_NNS folded_VVN ,_, The NNS could also NN, and the VVN could also be VBG. And I just want to find that and other strings with the same tags (NNS or NN followed b VVN or VBG followed by comma). The following regex is what I am trying to use, but it is

Find frequency of a custom word in R TermDocumentMatrix using TM package

喜欢而已 提交于 2020-01-05 04:28:10
问题 I turned about 50,000 rows of varchar data into a corpus, and then proceeded to clean said corpus using the TM package, getting ride of stopwords, punctuation, and numbers. I then turned it into a TermDocumentMatrix and used the functions findFreqTerms and findMostFreqTerms to run text analysis. findMostFreqTerms return the common words, and the number of times it shows up in the data. However, I want to use a function that says search for "word" and return how many times "word" appears in

Treat words separated by space in the same manner

送分小仙女□ 提交于 2020-01-03 07:30:21
问题 I am trying to find the words occurring in multiple documents at the same time. Let us take an example. doc1: "this is a document about milkyway" doc2: "milky way is huge" As you can see in above 2 documents, word "milkyway" is occurring in both the docs but in the second document term "milkyway" is separated by a space and in first doc it is not. I am doing the following to get the document term matrix in R. library(tm) tmp.text <- data.frame(rbind(doc1, doc2)) tmp.corpus <- Corpus