text-mining | 易学教程

R construct document term matrix how to match dictionaries whose values consist of white-space separated phrases

阅读更多关于 R construct document term matrix how to match dictionaries whose values consist of white-space separated phrases

问题 When do text mining using R, after reprocessing text data, we need create a document-term matrix for further exploring. But in similar with Chinese, English also have some certain phases, such as "semantic distance", "machine learning", if you segment them into word, it have totally different meanings, I want to know how to match pre-defined dictionaries whose values consist of white-space separated terms, such as contains "semantic distance", "machine learning". if a document is "we could

big document term matrix - error when counting the number of characters of documents

阅读更多关于 big document term matrix - error when counting the number of characters of documents

问题 I have built a big document-term matrix with the package RTextTools . Now I am trying to count the number of characters in the matrix rows so that I can remove empty documents before performing topic modeling. My code gives no errors when I apply it to a sample of my corpus, obtaining a smaller matrix, but when I try to count the row length of the documents in the matrix produced from my entire corpus (~75000 tweets) I get the following error message: Error in vector(typeof(x$v), nr * nc) :

CPU-and-memory efficient NGram extraction with R

阅读更多关于 CPU-and-memory efficient NGram extraction with R

I wrote an algorithm which extract NGrams (bigrams, trigrams, ... till 5-grams) from a list of 50000 street addresses. My goal is to have for each address a boolean vector representing whether the NGrams are present or not in the address. Therefor each address will be characterized by a vector of attributes, and then I can carry out a clustering on the addresses. The algo works that way : I start with the bi-grams, I calculate all the combinations of (a-z and 0-9 and / and tabulation) : for example : aa,ab,ac,...,a8,a9,a/,a ,ba,bb,... Then I carry out a loop for each address and extract for

all possible wordform completions of a (biomedical) word's stem

阅读更多关于 all possible wordform completions of a (biomedical) word's stem

I'm familiar with word stemming and completion from the tm package in R. I'm trying to come up with a quick and dirty method for finding all variants of a given word (within some corpus.) For example, I'd like to get "leukocytes" and "leuckocytic" if my input is "leukocyte". If I had to do it right now, I would probably just go with something like: library(tm) library(RWeka) dictionary <- unique(unlist(lapply(crude, words))) grep(pattern = LovinsStemmer("company"), ignore.case = T, x = dictionary, value = T) I used Lovins because Snowball's Porter doesn't seem to be aggressive enough. I'm open

clustering list of words in python

阅读更多关于 clustering list of words in python

问题 I am a newbie in text mining, here is my situation. Suppose i have a list of words ['car', 'dog', 'puppy', 'vehicle'], i would like to cluster words into k groups, I want the output to be [['car', 'vehicle'], ['dog', 'puppy']]. I first calculate similarity score of each pairwise word to obtain a 4x4 matrix(in this case) M, where Mij is the similarity score of word i and j. After transforming the words into numeric data, i utilize different clustering library(such as sklearn) or implement it

Cosine similarity of 2 DTMs in R

阅读更多关于 Cosine similarity of 2 DTMs in R

问题 I have 2 Document term matrices: DTM 1 has say 1000 vectors(1000 docs) and DTM2 has 20 vectors (20 docs) So basically I want to compare each document of DTM1 against DTM2 and would want to see which DTM1 docs are closest to which DTM2 docs using the cosine function. Any pointers would help! I have created a cosine matrix using the "slam" package. Docs –glyma –ie –initi –stafford ‘bureaucratic’ ‘empti ‘holi ‘incontrovert 1 0.000000 0 0.000000 0.000000 0.000000 0 0 0 2 0.000000 0 0.000000 0

In R tm package, build corpus FROM Document-Term-Matrix

阅读更多关于 In R tm package, build corpus FROM Document-Term-Matrix

It's straightforward to build a document-term matrix from a corpus with the tm package. I'd like to build a corpus from a document-term-matrix. Let M be the number of documents in a document set. Let V be the number of terms in the vocabulary of that document set.Then a document-term-matrix is an M*V matrix. I also have a vocabulary vector, of length V. In the vocabulary vector are the words represented by indices in the document-term-matrix. From the dtm and vocabulary vector, I'd like to construct a "corpus" object. This is because I'd like to stem my document set. I built my dtm and vocab

all possible wordform completions of a (biomedical) word's stem

阅读更多关于 all possible wordform completions of a (biomedical) word's stem

问题 I'm familiar with word stemming and completion from the tm package in R. I'm trying to come up with a quick and dirty method for finding all variants of a given word (within some corpus.) For example, I'd like to get "leukocytes" and "leuckocytic" if my input is "leukocyte". If I had to do it right now, I would probably just go with something like: library(tm) library(RWeka) dictionary <- unique(unlist(lapply(crude, words))) grep(pattern = LovinsStemmer("company"), ignore.case = T, x =

In R tm package, build corpus FROM Document-Term-Matrix

阅读更多关于 In R tm package, build corpus FROM Document-Term-Matrix

问题 It's straightforward to build a document-term matrix from a corpus with the tm package. I'd like to build a corpus from a document-term-matrix. Let M be the number of documents in a document set. Let V be the number of terms in the vocabulary of that document set.Then a document-term-matrix is an M*V matrix. I also have a vocabulary vector, of length V. In the vocabulary vector are the words represented by indices in the document-term-matrix. From the dtm and vocabulary vector, I'd like to

sentiment analysis - wordNet , sentiWordNet lexicon

阅读更多关于 sentiment analysis - wordNet , sentiWordNet lexicon

I need a list of positive and negative words with the weights assigned to words according to how strong and week they are. I have got : 1.) WordNet - It gives a + or - score for every word. 2.) SentiWordNet - Giving positive and negative values in the range [0,1]. I checked these on few words, love - wordNet is giving 0.0 for both noun and verb, I dont know why i think it should be positive by at least some factor. repress - wordNet gives -9.93 - SentiWordNet gives - 0.0 for both pos and neg. (should be negative) repose - wordNet - 2.488 - SentiWordNet - { pos - 0.125, neg - 0.5 } (should be