tm

Arabic text not showing in R-

旧时模样 提交于 2019-12-12 16:25:44
问题 Just started working with R in Arabic as I plan to do text analysis and text mining with Hadith corpus. I have been reading threads related to my question but nevertheless, still can't manage to get the REAL basics here (sorry, absolute beginner). So, I entered: textarabic.v <- scan("data/arabic-text.txt", encoding="UTF-8", what= "character",sep="\n") And what comes out textarabic.v is of course, symbols (pic). Prior to this, I saved my text in utf-8 as I read in a thread but still nothing

Error in simple_triplet_matrix — unable to use RWeka to count Phrases

对着背影说爱祢 提交于 2019-12-12 15:39:41
问题 Using TM, I'm comparing a DocumentTermMatrix against a dictionary list to count totals: totals <- inspect(DocumentTermMatrix(x, list(dictionary = d))) This works great for single words, but I want to include double words and can't figure out how to do this. I tried RWeka: TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3)) tdm <- TermDocumentMatrix(v.corpus, control = list(tokenize = TrigramTokenizer)) BUt get the following error message: Error in simple_triplet

tm.package: findAssocs vs Cosine

Deadly 提交于 2019-12-12 12:09:17
问题 I'm new here and my questions is of mathematical rather than programming nature where I would like to get a second opinion on whether my approach makes sense. I was trying to find associations between words in my corpus using the function findAssocs , from the tm package. Even though it appears to perform reasonably well on the data available through the package, such as New York Times and US Congress, I was disappointed with its performance on my own, less tidy dataset. It appears to be

Removing rows from Corpus with multiple documents

烂漫一生 提交于 2019-12-12 09:52:40
问题 I have 4000 text documents in corpus. I want to remove row(s) that contains a specific word from each document as a part of data clean up. For example: library(tm) doc.corpus<- VCorpus(DirSource("C:\\TextMining\\Prototype",pattern="*.txt",encoding= "UTF8",mode = "text"),readerControl=list(language="en")) doc.corpus<- tm_map(doc.corpus, PlainTextDocument) doc.corpus[[1]] #PlainTextDocument Metadata: 7 Content: chars: 16542 as.character(doc.corpus)[[1]] $content "Quick to deploy, easy to use,

Example of tm use

孤人 提交于 2019-12-12 08:41:33
问题 Can you give an example of use of tm (I don't know how to initialize that struct ) where the current date is written in this format y/m/d ? 回答1: How to use tm structure call time() to get current date/time as number of seconds since 1 Jan 1970. call localtime() to get struct tm pointer. If you want GMT them call gmtime() instead of localtime() . Use sprintf() or strftime() to convert the struct tm to a string in any format you want. Example #include <stdio.h> #include <time.h> int main () {

text mining with tm package in R ,remove words starting from [http] or any other specifc word

一世执手 提交于 2019-12-12 07:28:44
问题 I am new to R and text mining. I had made a word cloud out of twitter feed related to some term. The problem that I'm facing is that in the wordcloud it shows http:... or htt... How do I deal about this issue I tried using metacharacter * but I still doubt if I'm applying it right tw.text = removeWords(tw.text,c(stopwords("en"),"rt","http\\*")) somebody into text-minning please help me with this. 回答1: If you are looking to remove URLs from your string, you may use: gsub("(f|ht)tp(s?)://(.*)[.

R: compilation failed for package 'slam'

为君一笑 提交于 2019-12-12 04:29:35
问题 I am using R 2.15.2 and I want to install the tm package to do some text analysis. I have downloaded the compatible tm package from the CRAN archives. I downloaded tm_0.5-9 I tried to install it using install.packages("/Downloads/tm_0.5-9.tar.gz", repos = NULL, type="source", dependencies = TRUE) and got the following error Installing package(s) into ‘/Documents/R/win-library/2.15’ (as ‘lib’ is unspecified) ERROR: dependency 'slam' is not available for package 'tm' * removing '/Documents/R

set encoding for reading text files into tm Corpora

家住魔仙堡 提交于 2019-12-12 03:45:42
问题 loading a bunch of documents using tm Corpus i need to specify encoding. All documents are UTF-8 encoded. If openend via text editor content is ok but corpus contents is full of strange symbols (indicioâ., ‘sœs....) Source text is in spanish. ES_es library(tm) cname <- file.path("C:", "Users", "john", "Documents", "texts") docs <- Corpus(DirSource(cname), encoding ="UTF-8") > Error in Corpus(DirSource(cname), encoding = "UTF-8") : unused argument (encoding = "UTF-8") EDITED: Getting str

Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) : there is no package called ‘Rcpp’ [duplicate]

三世轮回 提交于 2019-12-12 03:02:21
问题 This question already has answers here : Error in loadNamespace(name) : there is no package called 'Rcpp' (6 answers) Closed 3 years ago . Basically I want to use wordcloud function. I'm working through R console. But I could use Rstudio if thats the problem. When I use wordcloud(r_stats_text_corpus) Error: could not find function "wordcloud" I also tried library("wordcloud") Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) : there is no package called ‘Rcpp’ Error:

Using DocumentTermMatrix on a Vector of First and Last Names

空扰寡人 提交于 2019-12-12 02:26:44
问题 I have a column in my data frame (df) as follows: > people = df$people > people[1:3] [1] "Christian Slater, Tara Reid, Stephen Dorff, Frank C. Turner" [2] "Ice Cube, Nia Long, Aleisha Allen, Philip Bolden" [3] "John Travolta, Uma Thurman, Vince Vaughn, Cedric the Entertainer" The column has 4k+ unique first/last/nick names as a list of full names on each row as shown above. I would like to create a DocumentTermMatrix for this column where full name matches are found and only the names that