tm | 易学教程

Arabic text not showing in R-

阅读更多关于 Arabic text not showing in R-

问题 Just started working with R in Arabic as I plan to do text analysis and text mining with Hadith corpus. I have been reading threads related to my question but nevertheless, still can't manage to get the REAL basics here (sorry, absolute beginner). So, I entered: textarabic.v <- scan("data/arabic-text.txt", encoding="UTF-8", what= "character",sep="\n") And what comes out textarabic.v is of course, symbols (pic). Prior to this, I saved my text in utf-8 as I read in a thread but still nothing

Error in simple_triplet_matrix — unable to use RWeka to count Phrases

阅读更多关于 Error in simple_triplet_matrix — unable to use RWeka to count Phrases

问题 Using TM, I'm comparing a DocumentTermMatrix against a dictionary list to count totals: totals <- inspect(DocumentTermMatrix(x, list(dictionary = d))) This works great for single words, but I want to include double words and can't figure out how to do this. I tried RWeka: TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3)) tdm <- TermDocumentMatrix(v.corpus, control = list(tokenize = TrigramTokenizer)) BUt get the following error message: Error in simple_triplet

tm.package: findAssocs vs Cosine

阅读更多关于 tm.package: findAssocs vs Cosine

问题 I'm new here and my questions is of mathematical rather than programming nature where I would like to get a second opinion on whether my approach makes sense. I was trying to find associations between words in my corpus using the function findAssocs , from the tm package. Even though it appears to perform reasonably well on the data available through the package, such as New York Times and US Congress, I was disappointed with its performance on my own, less tidy dataset. It appears to be

Removing rows from Corpus with multiple documents

阅读更多关于 Removing rows from Corpus with multiple documents

问题 I have 4000 text documents in corpus. I want to remove row(s) that contains a specific word from each document as a part of data clean up. For example: library(tm) doc.corpus<- VCorpus(DirSource("C:\\TextMining\\Prototype",pattern="*.txt",encoding= "UTF8",mode = "text"),readerControl=list(language="en")) doc.corpus<- tm_map(doc.corpus, PlainTextDocument) doc.corpus[[1]] #PlainTextDocument Metadata: 7 Content: chars: 16542 as.character(doc.corpus)[[1]] $content "Quick to deploy, easy to use,

Example of tm use

阅读更多关于 Example of tm use

问题 Can you give an example of use of tm (I don't know how to initialize that struct ) where the current date is written in this format y/m/d ? 回答1: How to use tm structure call time() to get current date/time as number of seconds since 1 Jan 1970. call localtime() to get struct tm pointer. If you want GMT them call gmtime() instead of localtime() . Use sprintf() or strftime() to convert the struct tm to a string in any format you want. Example #include <stdio.h> #include <time.h> int main () {

text mining with tm package in R ,remove words starting from [http] or any other specifc word

阅读更多关于 text mining with tm package in R ,remove words starting from [http] or any other specifc word

问题 I am new to R and text mining. I had made a word cloud out of twitter feed related to some term. The problem that I'm facing is that in the wordcloud it shows http:... or htt... How do I deal about this issue I tried using metacharacter * but I still doubt if I'm applying it right tw.text = removeWords(tw.text,c(stopwords("en"),"rt","http\\*")) somebody into text-minning please help me with this. 回答1: If you are looking to remove URLs from your string, you may use: gsub("(f|ht)tp(s?)://(.*)[.

R: compilation failed for package 'slam'

阅读更多关于 R: compilation failed for package 'slam'

问题 I am using R 2.15.2 and I want to install the tm package to do some text analysis. I have downloaded the compatible tm package from the CRAN archives. I downloaded tm_0.5-9 I tried to install it using install.packages("/Downloads/tm_0.5-9.tar.gz", repos = NULL, type="source", dependencies = TRUE) and got the following error Installing package(s) into ‘/Documents/R/win-library/2.15’ (as ‘lib’ is unspecified) ERROR: dependency 'slam' is not available for package 'tm' * removing '/Documents/R

set encoding for reading text files into tm Corpora

阅读更多关于 set encoding for reading text files into tm Corpora

问题 loading a bunch of documents using tm Corpus i need to specify encoding. All documents are UTF-8 encoded. If openend via text editor content is ok but corpus contents is full of strange symbols (indicioâ., ‘sœs....) Source text is in spanish. ES_es library(tm) cname <- file.path("C:", "Users", "john", "Documents", "texts") docs <- Corpus(DirSource(cname), encoding ="UTF-8") > Error in Corpus(DirSource(cname), encoding = "UTF-8") : unused argument (encoding = "UTF-8") EDITED: Getting str

Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) : there is no package called ‘Rcpp’ [duplicate]

阅读更多关于 Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) : there is no package called ‘Rcpp’ [duplicate]

问题 This question already has answers here : Error in loadNamespace(name) : there is no package called 'Rcpp' (6 answers) Closed 3 years ago . Basically I want to use wordcloud function. I'm working through R console. But I could use Rstudio if thats the problem. When I use wordcloud(r_stats_text_corpus) Error: could not find function "wordcloud" I also tried library("wordcloud") Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) : there is no package called ‘Rcpp’ Error:

Using DocumentTermMatrix on a Vector of First and Last Names

阅读更多关于 Using DocumentTermMatrix on a Vector of First and Last Names

问题 I have a column in my data frame (df) as follows: > people = df$people > people[1:3] [1] "Christian Slater, Tara Reid, Stephen Dorff, Frank C. Turner" [2] "Ice Cube, Nia Long, Aleisha Allen, Philip Bolden" [3] "John Travolta, Uma Thurman, Vince Vaughn, Cedric the Entertainer" The column has 4k+ unique first/last/nick names as a list of full names on each row as shown above. I would like to create a DocumentTermMatrix for this column where full name matches are found and only the names that