text-mining

How to find term frequency within a DTM in R?

我的未来我决定 提交于 2019-12-07 12:44:28
问题 I've been using the tm package to create a DocumentTerm Matrix as follows: library(tm) library(RWeka) library(SnowballC) src <- DataframeSource(data.frame(data3$JobTitle)) # create a corpus and transform data # Sets the default number of threads to use options(mc.cores=1) c_copy <- c <- Corpus(src) c <- tm_map(c, content_transformer(tolower), mc.cores=1) c <- tm_map(c,content_transformer(removeNumbers), mc.cores=1) c <- tm_map(c,removeWords, stopwords("english"), mc.cores=1) c <- tm_map(c

Print first line of one element of Corpus in R using tm package

半世苍凉 提交于 2019-12-07 09:56:28
How do you print a small sample, or first line, of a corpus in R using the tm package? I have a very large corpus ( > 1 GB) and am doing some text cleaning. I would like to test as I apply cleaning procedures. Printing just the first line, or first few lines of a corpus would be ideal. # Load Libraries library(tm) # Read in Corpus corp <- SimpleCorpus( DirSource( "C:/TextDocument")) # Remove puncuation corp <- removePunctuation(corp, preserve_intra_word_contractions = TRUE, preserve_intra_word_dashes = TRUE) I have tried accessing the corpus several ways: # Print first line of first element of

Explicit Semantic Analysis

若如初见. 提交于 2019-12-07 05:52:12
问题 I came across this term called 'Explicit Semantic Analysis ' which uses Wikipedia as a reference and finds the similarity in documents and categorizes them into classes (correct me if i am wrong). The link i came across is here I wanted to learn more about it. Please help me out with it ! 回答1: This explicit semantic analysis works on similar lines as semantic similarity . I got hold of this link which provides a clear example of ESA 来源: https://stackoverflow.com/questions/8707624/explicit

converting stemmed word to the root word in R

坚强是说给别人听的谎言 提交于 2019-12-06 20:41:30
Hi I have a list of words which have been stemmed using the "tm" package in R. Can I get back the root word some how after this step. Thanks in Advance. Ex : activiti --> activity You can use the stemCompletion() function to achieve this, but you may need to trim the stems first. Consider the following: library(tm) library(qdap) # providers the stemmer() function active.text = "there are plenty of funny activities" active.corp = Corpus(VectorSource(active.text)) (st.text = tolower(stemmer(active.text,warn=F))) # this is what the columns of your Term Document Matrix are going to look like [1]

Calculate similarity between list of words

走远了吗. 提交于 2019-12-06 16:24:08
问题 I want to calculate the similarity between two list of words, for example : ['email','user','this','email','address','customer'] is similar to this list: ['email','mail','address','netmail'] I want to have a higher percentage of similarity than another list, for example: ['address','ip','network'] even if address exists in the list. 回答1: Since you haven't really been able to demonstrate a crystal output, here is my best shot: list_A = ['email','user','this','email','address','customer'] list

Why isn't stemDocument stemming?

走远了吗. 提交于 2019-12-06 16:19:40
I am using the 'tm' package in R to create a term document matrix using stemmed terms. The process is completing, but the resulting matrix includes terms that don't appear to have been stemmed, and I'm trying to understand why that is and how to fix it. Here is the script for the process, which uses a couple of online news stories as the sandbox: library(boilerpipeR) library(RCurl) library(tm) # Pull the relevant parts of the news stories using 'boilerpipeR' and 'RCurl' url <- "http://blogs.wsj.com/digits/2015/07/14/google-mozilla-disable-flash-over-security-concerns/" extract <-

Is there an algorithm for determining the relevance of a text to a theme?

一曲冷凌霜 提交于 2019-12-06 15:02:30
I want to know what can be used to determine the relevance of a page for a theme like games, movies, etc. Is there some research in this area or is there only counting how many times some relevant words appear? The common choice is supervised document classification on bag of words (or bag of n-grams) features, preferably with tf-idf weighting. Popular algorithms include Naive Bayes and (linear) SVMs. For this approach, you'll need labeled training data, i.e. documents annotated with relevant themes. See, e.g., Introduction to Information Retrieval , chapters 13-15. 来源: https://stackoverflow

Remove stopwords and tolower function slow on a Corpus in R

无人久伴 提交于 2019-12-06 15:02:11
I have corpus roughly with 75 MB data. I am trying to use the following command tm_map(doc.corpus, removeWords, stopwords("english")) tm_map(doc.corpus, tolower) This two alone functions are taking at least 40 mins to run. I am looking for speeding up the process as I am using tdm matrix for my model. I have tried commands like gc() and memory.limit(10000000) very frequently but I am not able to speed up my process speed. I have a system with 4GB RAM and running a local database to read the input data. Hoping for suggestions to speed up! Maybe you can give quanteda a try library(stringi)

How to divide text (string) by a certain character using r

时光毁灭记忆、已成空白 提交于 2019-12-06 14:39:33
How to classify strings using r My text file is such a structure. >cell_c2< 8/30/2017 This location has been closed for a few months. Recently I passed by and attracted by their street sign Teriyaki Grill Open. I gave a try. The cashier was friendly and recommended me to try their most popular Teriyaki chicken box. It came with mixed vege and steamed rice. They have an open kitchen with SS equipment. I could see the chef make grill after my order was placed. I love the teriyaki chicken with white rice. The full box costs $8 after tax. I think it's pretty reasonable for what you get near the

Finding unusual phrases using a “bag of usual phrases”

北慕城南 提交于 2019-12-06 14:26:02
问题 My goal is to input an array of phrases as in array = ["Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.","At vero eos et accusam et justo duo dolores et ea rebum.","Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet."] and to present a new phrase to it, like "Felix qui potuit rerum cognoscere causas" and I want it to tell me whether this is likely part of