tm | 易学教程

How to search for specific n-grams in a corpus using R

阅读更多关于 How to search for specific n-grams in a corpus using R

问题 I'm looking for specific n-grams in a corpus. Let's say I want to find 'asset management' and 'historical yield' in a collection of documents. This is how I loaded the corpus my_corpus <- VCorpus(DirSource(directory, pattern = ".pdf"), readerControl = list(reader = readPDF) I cleaned the corpus and did some basic calculations using document term matrices. Now I want to look for particular expressions and put them in a dataframe. This is what I use (thanks to phiver): ngrams <- c('asset

Inconsistent behaviour with tm_map transformation functions when using multiple cores

阅读更多关于 Inconsistent behaviour with tm_map transformation functions when using multiple cores

问题 Another potential title for this post could be "When parallel processing in r, does the ratio between number of cores, loop chunk size and object size matter?" I have a corpus I am running some transformations on using tm package. Since the corpus is large I'm using parallel processing with doparallel package. Sometimes the transformations do the task, but sometimes they do not. For example, tm::removeNumbers() . The very first document in the corpus has a content value of "n417". So if

R text mining documents from CSV file (one row per doc)

阅读更多关于 R text mining documents from CSV file (one row per doc)

问题 I am trying to work with the tm package in R, and have a CSV file of customer feedback with each line being a different instance of feedback. I want to import all the content of this feedback into a corpus but I want each line to be a different document within the corpus, so that I can compare the feedback in a DocTerms Matrix. There are over 10,000 rows in my data set. Originally I did the following: fdbk_corpus <-Corpus(VectorSource(fdbk), readerControl = list(language="eng"), sep="\t")

R-Project no applicable method for 'meta' applied to an object of class “character”

阅读更多关于 R-Project no applicable method for 'meta' applied to an object of class “character”

问题 I am trying to run this code (Ubuntu 12.04, R 3.1.1) # Load requisite packages library(tm) library(ggplot2) library(lsa) # Place Enron email snippets into a single vector. text <- c( "To Mr. Ken Lay, I’m writing to urge you to donate the millions of dollars you made from selling Enron stock before the company declared bankruptcy.", "while you netted well over a $100 million, many of Enron's employees were financially devastated when the company declared bankruptcy and their retirement plans

R-Project no applicable method for 'meta' applied to an object of class “character”

阅读更多关于 R-Project no applicable method for 'meta' applied to an object of class “character”

R-Project no applicable method for 'meta' applied to an object of class “character”

阅读更多关于 R-Project no applicable method for 'meta' applied to an object of class “character”

Arrange the words of the Document Term Matrix by frequency in R

阅读更多关于 Arrange the words of the Document Term Matrix by frequency in R

问题 i'm sorry for new question , but i newbie in text mining, and need in advices of profy. Now, after long torments with content_transformer i have clean corpus The next question 1. How select from `dtm` the words with small frequencies , so that the amount of frequencies was not more than 1% For example i need this format x 0,5% of all words in the dataset y 0,2% z 0,3% so here total frequencies sum =1% How do this? 回答1: You can take a look into the termDocumentMatrix function of the tm package

How do I extract contents from a koRpus object in R?

阅读更多关于 How do I extract contents from a koRpus object in R?

问题 I'm using the tm package, and looking to get the Flesch-Kincaid scores for a document using R. I found the koRpus package has some a lot of metrics including reading-level, and started using that. However, the object returned seems to be a very complicated s4 object I don't understand how to parse. So, I apply this to my corpus: txt <- system.file("texts", "txt", package = "tm") (d <- Corpus(DirSource(txt, encoding = "UTF-8"), readerControl = list(language = "lat"))) f <- function(x) tokenize

remove emoticons in R using tm package

阅读更多关于 remove emoticons in R using tm package

问题 I'm using the tm package to clean up a Twitter Corpus. However, the package is unable to clean up emoticons. Here's a replicated code: July4th_clean <- tm_map(July4th_clean, content_transformer(tolower)) Error in FUN(content(x), ...) : invalid input 'RT ElleJohnson Love of country is encircling the globes �� july4thweekend July4th FourthOfJuly IndependenceDay NotAvailableOnIn' in 'utf8towcs' Can someone point me in the right direction to remove the emoticons using the tm

“The process has forked…” Error while using tm package in R

阅读更多关于 “The process has forked…” Error while using tm package in R

问题 I installed the tm package in R to do some text mining analysis. After creating a corpus I wanted to use the tm_map() function, which throws the following error message: The process has forked and you cannot use this CoreFoundation functionality safely. You MUST exec().Break on Anybody an idea why this message turns up? Here's more code for clarification: > require(tm) Lade nötiges Paket: tm > a <-Corpus(VectorSource(chi2014_df$text)) > a A corpus with 70 text documents > a <-tm_map(a,tolower