text-mining | 易学教程

Finding ngrams in R and comparing ngrams across corpora

阅读更多关于 Finding ngrams in R and comparing ngrams across corpora

问题 I'm getting started with the tm package in R, so please bear with me and apologies for the big ol' wall of text. I have created a fairly large corpus of Socialist/Communist propaganda and would like to extract newly coined political terms (multiple words, e.g. "struggle-criticism-transformation movement"). This is a two-step question, one regarding my code so far and one regarding how I should go on. Step 1: To do this, I wanted to identify some common ngrams first. But I get stuck very early

How to break conversation data into pairs of (Context , Response)

阅读更多关于 How to break conversation data into pairs of (Context , Response)

问题 I'm using Gensim Doc2Vec model, trying to cluster portions of a customer support conversations. My goal is to give the support team an auto response suggestions. Figure 1: shows a sample conversations where the user question is answered in the next conversation line, making it easy to extract the data: during the conversation "hello" and "Our offices are located in NYC" should be suggested Figure 2: describes a conversation where the questions and answers are not in sync during the

Math of tm::findAssocs how does this function work?

阅读更多关于 Math of tm::findAssocs how does this function work?

I have been using findAssoc() with textmining ( tm package) but realized that something doesn't seem right with my dataset. My dataset is 1500 open ended answers saved in one column of csv file. So I called the dataset like this and used typical tm_map to make it to corpus. library(tm) Q29 <- read.csv("favoritegame2.csv") corpus <- Corpus(VectorSource(Q29$Q29)) corpus <- tm_map(corpus, tolower) corpus <- tm_map(corpus, removePunctuation) corpus <- tm_map(corpus, removeNumbers) corpus<- tm_map(corpus, removeWords, stopwords("english")) dtm<- DocumentTermMatrix(corpus) findAssocs(dtm, "like", .2

build word co-occurence edge list in R

阅读更多关于 build word co-occurence edge list in R

I have a chunk of sentences and I want to build the undirected edge list of word co-occurrence and see the frequency of every edge. I took a look at the tm package but didn't find similar functions. Is there some package/script I can use? Thanks a lot! Note: A word doesn't co-occur with itself. A word which appears twice or more co-occurs with other words for only once in the same sentence. DF: sentence_id text 1 a b c d e 2 a b b e 3 b c d 4 a e 5 a 6 a a a OUTPUT word1 word2 freq a b 2 a c 1 a d 1 a e 3 b c 2 b d 2 b e 2 c d 2 c e 1 d e 1 It's convoluted so there's got to be a better

Use scikit-learn TfIdf with gensim LDA

阅读更多关于 Use scikit-learn TfIdf with gensim LDA

I've used various versions of TFIDF in scikit learn to model some text data. vectorizer = TfidfVectorizer(min_df=1,stop_words='english') The resulting data X is in this format: <rowsxcolumns sparse matrix of type '<type 'numpy.float64'>' with xyz stored elements in Compressed Sparse Row format> I wanted to experiment with LDA as a way to do reduce dimensionality of my sparse matrix. Is there a simple way to feed the NumPy sparse matrix X into a gensim LDA model? lda = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=100) I can ignore scikit and go the way the gensim

How to recreate same DocumentTermMatrix with new (test) data

阅读更多关于 How to recreate same DocumentTermMatrix with new (test) data

问题 Suppose I have text based training data and testing data. To be more specific, I have two data sets - training and testing - and both of them have one column which contains text and is of interest for the job at hand. I used tm package in R to process the text column in the training data set. After removing the white spaces, punctuation, and stop words, I stemmed the corpus and finally created a document term matrix of 1 grams containing the frequency/count of the words in each document. I

How to access Wikipedia from R?

阅读更多关于 How to access Wikipedia from R?

问题 Is there any package for R that allows querying Wikipedia (most probably using Mediawiki API) to get list of available articles relevant to such query, as well as import selected articles for text mining? 回答1: Use the RCurl package for retreiving info, and the XML or RJSONIO packages for parsing the response. If you are behind a proxy, set your options. opts <- list( proxy = "136.233.91.120", proxyusername = "mydomain\\myusername", proxypassword = 'whatever', proxyport = 8080 ) Use the

Methods for extracting locations from text?

阅读更多关于 Methods for extracting locations from text?

问题 What are the recommended methods for extracting locations from free text? What I can think of is to use regex rules like "words ... in location". But are there better approaches than this? Also I can think of having a lookup hash table table with names for countries and cities and then compare every extracted token from the text to that of the hash table. Does anybody know of better approaches? Edit: I'm trying to extract locations from tweets text. So the issue of high number of tweets might

How to find ngram frequency of a column in a pandas dataframe?

阅读更多关于 How to find ngram frequency of a column in a pandas dataframe?

问题 Below is the input pandas dataframe I have. I want to find the frequency of unigrams & bigrams. A sample of what I am expecting is shown below How to do this using nltk or scikit learn? I wrote the below code which takes a string as input. How to extend it to series/dataframe? from nltk.collocations import * desc='john is a guy person you him guy person you him' tokens = nltk.word_tokenize(desc) bigram_measures = nltk.collocations.BigramAssocMeasures() finder = BigramCollocationFinder.from

R text mining documents from CSV file (one row per doc)

阅读更多关于 R text mining documents from CSV file (one row per doc)

I am trying to work with the tm package in R, and have a CSV file of customer feedback with each line being a different instance of feedback. I want to import all the content of this feedback into a corpus but I want each line to be a different document within the corpus, so that I can compare the feedback in a DocTerms Matrix. There are over 10,000 rows in my data set. Originally I did the following: fdbk_corpus <-Corpus(VectorSource(fdbk), readerControl = list(language="eng"), sep="\t") This creates a corpus with 1 document and >10,000 rows, and I want >10,000 docs with 1 row each. I imagine