tm

tm readPDF: Error in file(con, “r”) : cannot open the connection

冷暖自知 提交于 2019-11-29 17:14:00
I have tried the example code recommended in the tm::readPDF documentation : library(tm) if(all(file.exists(Sys.which(c("pdfinfo", "pdftotext"))))) { uri <- system.file(file.path("doc", "tm.pdf"), package = "tm") pdf <- readPDF(PdftotextOptions = "-layout")(elem = list(uri = uri), language = "en", id = "id1") pdf[1:13] } But I get the following error (which occurs after calling the function returned by readPDF ): Error in file(con, "r") : cannot open the connection In addition: Warning message: In file(con, "r") : cannot open file 'C:\DOCUME~1\Tomas\LOCALS~1\Temp\RtmpU33iWo\pdfinfo31c2bd5762a'

Adding custom stopwords in R tm

白昼怎懂夜的黑 提交于 2019-11-29 17:10:48
问题 I have a Corpus in R using the tm package. I am applying the removeWords function to remove stopwords tm_map(abs, removeWords, stopwords("english")) Is there a way to add my own custom stop words to this list? 回答1: stopwords just provides you with a vector of words, just c ombine your own ones to this. tm_map(abs, removeWords, c(stopwords("english"),"my","custom","words")) 回答2: Save your custom stop words in a csv file (ex: word.csv ). library(tm) stopwords <- read.csv("word.csv", header =

stemDocment in tm package not working on past tense word

て烟熏妆下的殇ゞ 提交于 2019-11-29 16:37:37
I have a file 'check_text.txt' that contains " said say says make made ". I'd like to perform stemming on it to get "say say say make make". I tried to use stemDocument in tm package, as the following, but only get "said say say make made". Is there a way to perform stemming on past tense words? Is it necessary to do so in real-world natural language processing? Thanks! filename = 'check_text.txt' con <- file(filename, "rb") text_data <- readLines(con,skipNul = TRUE) close(con) text_VS <- VectorSource(text_data) text_corpus <- VCorpus(text_VS) text_corpus <- tm_map(text_corpus, stemDocument,

How Do I Parse a Date Time String That Includes Fractional Time?

做~自己de王妃 提交于 2019-11-29 15:14:17
I have a date time string: 20:48:01.469 UTC MAR 31 2016 I would like to convert this string representation of time to a struct tm using strptime , but my format string isn't working. Is there a format specifier for fractional seconds? Perhaps %S , %s , or something else? Code snippet is below: tm tmbuf; const char *str = "20:48:01.469 UTC MAR 31 2016" const char *fmt = "%H:%M:%s %Z %b %d %Y"; strptime(str,fmt,&tmbuf); Using this free, open source C++11/14 library , here is another way to deal with parsing fractional seconds: #include "tz.h" #include <iostream> #include <sstream> int main() {

Removing overly common words (occur in more than 80% of the documents) in R

南楼画角 提交于 2019-11-29 14:55:40
问题 I am working with the 'tm' package in to create a corpus. I have done most of the preprocessing steps. The remaining thing is to remove overly common words (terms that occur in more than 80% of the documents). Can anybody help me with this? dsc <- Corpus(dd) dsc <- tm_map(dsc, stripWhitespace) dsc <- tm_map(dsc, removePunctuation) dsc <- tm_map(dsc, removeNumbers) dsc <- tm_map(dsc, removeWords, otherWords1) dsc <- tm_map(dsc, removeWords, otherWords2) dsc <- tm_map(dsc, removeWords,

R DocumentTermMatrix control list not working, silently ignores unknown parameters

别说谁变了你拦得住时间么 提交于 2019-11-29 14:14:02
问题 I have two following DTM-s: dtm <- DocumentTermMatrix(t) dtmImproved <- DocumentTermMatrix(t, control=list(minWordLength = 4, minDocFreq=5)) When I implement this, I see two equal DTM-s and if I open the dtmImproved , there are words with 3 symbols. Why doesn't the minWordLength parameter work? Thank you! > dtm A document-term matrix (591 documents, 10533 terms) Non-/sparse entries: 43058/6181945 Sparsity : 99% Maximal term length: 135 Weighting : term frequency (tf) > dtmImproved A document

Creating N-Grams with tm & RWeka - works with VCorpus but not Corpus

孤人 提交于 2019-11-29 09:40:12
问题 Following the many guides to creating biGrams using the 'tm' and 'RWeka' packages, I was getting frustrated that only 1-Grams were being returned in the tdm . Through much trial and error I discovered that proper function was achieved using ' VCorpus ' but not using ' Corpus '. BTW, I'm pretty sure this was working with 'Corpus' ~1 month ago but it is not now. R (3.3.3), RTools (3.4), RStudio (1.0.136) and all packages (tm 0.7-1, RWeka 0.4-31) have been updated to the latest. I would

Finding ngrams in R and comparing ngrams across corpora

杀马特。学长 韩版系。学妹 提交于 2019-11-29 07:58:41
问题 I'm getting started with the tm package in R, so please bear with me and apologies for the big ol' wall of text. I have created a fairly large corpus of Socialist/Communist propaganda and would like to extract newly coined political terms (multiple words, e.g. "struggle-criticism-transformation movement"). This is a two-step question, one regarding my code so far and one regarding how I should go on. Step 1: To do this, I wanted to identify some common ngrams first. But I get stuck very early

Error trying to read a PDF using readPDF from the tm package

我怕爱的太早我们不能终老 提交于 2019-11-29 07:27:36
(Windows 7 / R version 3.0.1) Below the commands and the resulting error: > library(tm) > pdf <- readPDF(PdftotextOptions = "-layout") > dat <- pdf(elem = list(uri = "17214.pdf"), language="de", id="id1") Error in file(con, "r") : cannot open the connection In addition: Warning message: In file(con, "r") : cannot open file 'C:\Users\Raffael\AppData\Local\Temp \RtmpS8Uql1\pdfinfo167c2bc159f8': No such file or directory How do I solve this issue? EDIT I (As suggested by Ben and described here ) I downloaded Xpdf copied the 32bit version to C:\Program Files (x86)\xpdf32 and the 64bit version to C

R Corpus Is Messing Up My UTF-8 Encoded Text

回眸只為那壹抹淺笑 提交于 2019-11-29 05:12:05
I am simply trying to create a corpus from Russian, UTF-8 encoded text. The problem is, the Corpus method from the tm package is not encoding the strings correctly. Here is a reproducible example of my problem: Load in the Russian text: > data <- c("Renault Logan, 2005","Складское помещение, 345 м²", "Су-шеф","3-к квартира, 64 м², 3/5 эт.","Samsung galaxy S4 mini GT-I9190 (чёрный)") Create a VectorSource: > vs <- VectorSource(data) > vs # outputs correctly Then, create the corpus: > corp <- Corpus(vs) > inspect(corp) # output is not encoded properly The output that I get is: > inspect(corp) <