text-mining | 易学教程

How to find term frequency within a DTM in R?

阅读更多关于 How to find term frequency within a DTM in R?

问题 I've been using the tm package to create a DocumentTerm Matrix as follows: library(tm) library(RWeka) library(SnowballC) src <- DataframeSource(data.frame(data3$JobTitle)) # create a corpus and transform data # Sets the default number of threads to use options(mc.cores=1) c_copy <- c <- Corpus(src) c <- tm_map(c, content_transformer(tolower), mc.cores=1) c <- tm_map(c,content_transformer(removeNumbers), mc.cores=1) c <- tm_map(c,removeWords, stopwords("english"), mc.cores=1) c <- tm_map(c

Print first line of one element of Corpus in R using tm package

阅读更多关于 Print first line of one element of Corpus in R using tm package

How do you print a small sample, or first line, of a corpus in R using the tm package? I have a very large corpus ( > 1 GB) and am doing some text cleaning. I would like to test as I apply cleaning procedures. Printing just the first line, or first few lines of a corpus would be ideal. # Load Libraries library(tm) # Read in Corpus corp <- SimpleCorpus( DirSource( "C:/TextDocument")) # Remove puncuation corp <- removePunctuation(corp, preserve_intra_word_contractions = TRUE, preserve_intra_word_dashes = TRUE) I have tried accessing the corpus several ways: # Print first line of first element of

Explicit Semantic Analysis

阅读更多关于 Explicit Semantic Analysis

问题 I came across this term called 'Explicit Semantic Analysis ' which uses Wikipedia as a reference and finds the similarity in documents and categorizes them into classes (correct me if i am wrong). The link i came across is here I wanted to learn more about it. Please help me out with it ! 回答1: This explicit semantic analysis works on similar lines as semantic similarity . I got hold of this link which provides a clear example of ESA 来源： https://stackoverflow.com/questions/8707624/explicit

converting stemmed word to the root word in R

阅读更多关于 converting stemmed word to the root word in R

Hi I have a list of words which have been stemmed using the "tm" package in R. Can I get back the root word some how after this step. Thanks in Advance. Ex : activiti --> activity You can use the stemCompletion() function to achieve this, but you may need to trim the stems first. Consider the following: library(tm) library(qdap) # providers the stemmer() function active.text = "there are plenty of funny activities" active.corp = Corpus(VectorSource(active.text)) (st.text = tolower(stemmer(active.text,warn=F))) # this is what the columns of your Term Document Matrix are going to look like [1]

Calculate similarity between list of words

阅读更多关于 Calculate similarity between list of words

问题 I want to calculate the similarity between two list of words, for example : ['email','user','this','email','address','customer'] is similar to this list: ['email','mail','address','netmail'] I want to have a higher percentage of similarity than another list, for example: ['address','ip','network'] even if address exists in the list. 回答1: Since you haven't really been able to demonstrate a crystal output, here is my best shot: list_A = ['email','user','this','email','address','customer'] list

Why isn't stemDocument stemming?

阅读更多关于 Why isn't stemDocument stemming?

I am using the 'tm' package in R to create a term document matrix using stemmed terms. The process is completing, but the resulting matrix includes terms that don't appear to have been stemmed, and I'm trying to understand why that is and how to fix it. Here is the script for the process, which uses a couple of online news stories as the sandbox: library(boilerpipeR) library(RCurl) library(tm) # Pull the relevant parts of the news stories using 'boilerpipeR' and 'RCurl' url <- "http://blogs.wsj.com/digits/2015/07/14/google-mozilla-disable-flash-over-security-concerns/" extract <-

Is there an algorithm for determining the relevance of a text to a theme?

阅读更多关于 Is there an algorithm for determining the relevance of a text to a theme?

I want to know what can be used to determine the relevance of a page for a theme like games, movies, etc. Is there some research in this area or is there only counting how many times some relevant words appear? The common choice is supervised document classification on bag of words (or bag of n-grams) features, preferably with tf-idf weighting. Popular algorithms include Naive Bayes and (linear) SVMs. For this approach, you'll need labeled training data, i.e. documents annotated with relevant themes. See, e.g., Introduction to Information Retrieval , chapters 13-15. 来源： https://stackoverflow

Remove stopwords and tolower function slow on a Corpus in R

阅读更多关于 Remove stopwords and tolower function slow on a Corpus in R

I have corpus roughly with 75 MB data. I am trying to use the following command tm_map(doc.corpus, removeWords, stopwords("english")) tm_map(doc.corpus, tolower) This two alone functions are taking at least 40 mins to run. I am looking for speeding up the process as I am using tdm matrix for my model. I have tried commands like gc() and memory.limit(10000000) very frequently but I am not able to speed up my process speed. I have a system with 4GB RAM and running a local database to read the input data. Hoping for suggestions to speed up! Maybe you can give quanteda a try library(stringi)

How to divide text (string) by a certain character using r

阅读更多关于 How to divide text (string) by a certain character using r

How to classify strings using r My text file is such a structure. >cell_c2< 8/30/2017 This location has been closed for a few months. Recently I passed by and attracted by their street sign Teriyaki Grill Open. I gave a try. The cashier was friendly and recommended me to try their most popular Teriyaki chicken box. It came with mixed vege and steamed rice. They have an open kitchen with SS equipment. I could see the chef make grill after my order was placed. I love the teriyaki chicken with white rice. The full box costs $8 after tax. I think it's pretty reasonable for what you get near the

Finding unusual phrases using a “bag of usual phrases”

阅读更多关于 Finding unusual phrases using a “bag of usual phrases”

问题 My goal is to input an array of phrases as in array = ["Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.","At vero eos et accusam et justo duo dolores et ea rebum.","Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet."] and to present a new phrase to it, like "Felix qui potuit rerum cognoscere causas" and I want it to tell me whether this is likely part of