tm | 易学教程

tm readPDF: Error in file(con, “r”) : cannot open the connection

阅读更多关于 tm readPDF: Error in file(con, “r”) : cannot open the connection

问题 I have tried the example code recommended in the tm::readPDF documentation: library(tm) if(all(file.exists(Sys.which(c("pdfinfo", "pdftotext"))))) { uri <- system.file(file.path("doc", "tm.pdf"), package = "tm") pdf <- readPDF(PdftotextOptions = "-layout")(elem = list(uri = uri), language = "en", id = "id1") pdf[1:13] } But I get the following error (which occurs after calling the function returned by readPDF ): Error in file(con, "r") : cannot open the connection In addition: Warning message

stemDocment in tm package not working on past tense word

阅读更多关于 stemDocment in tm package not working on past tense word

问题 I have a file 'check_text.txt' that contains " said say says make made ". I'd like to perform stemming on it to get "say say say make make". I tried to use stemDocument in tm package, as the following, but only get "said say say make made". Is there a way to perform stemming on past tense words? Is it necessary to do so in real-world natural language processing? Thanks! filename = 'check_text.txt' con <- file(filename, "rb") text_data <- readLines(con,skipNul = TRUE) close(con) text_VS <-

bind character vector to list into dataframe

阅读更多关于 bind character vector to list into dataframe

问题 I have a list of URLs and have extracted the content as follows: library(httr) link="http://www.workerspower.net/disposable-workers-the-real-price-of-sweat-shop-labor" get.link=GET(link) get.content=content(x2,as="text") extract.content=str_extract_all(y2,"<p>(.*?)</p>") This gives a "list of 1" with text. The length of each list is dependent on/varies with the URL. I would like to bind the URL [link] with the content [extract.content] and transform it into a dataframe and then import that

no applicable method for 'tm_map' applied to an object of class “character”

阅读更多关于 no applicable method for 'tm_map' applied to an object of class “character”

问题 My data looks like this: 1. Good quality, love the taste, the only ramen noodles we buy but they're available at the local Korean grocery store for a bit less so no need to buy on Amazon really. 2. Great flavor and taste. Prompt delivery.We will reorder this and other products from this manufacturer. 3. Doesn't taste good to me. 4. Most delicious ramen I have ever had. Spicy and tasty. Great price too. 5. I have this on my subscription, my family loves this version. The taste is great by

Using R for Text Mining Reuters-21578

阅读更多关于 Using R for Text Mining Reuters-21578

问题 I am trying to do some work with the well known Reuters-21578 dataset and am having some trouble with loading the sgm files into my corpus. Right now I am using the command require(tm) reut21578 <- system.file("reuters21578", package = "tm") reuters <-Corpus(DirSource(reut21578), readerControl = list(reader = readReut21578XML)) In an attempt to include all the files into my corpus but this gives me the following error: Error in DirSource(reut21578) : empty directory Any idea where I may be

how to set author for each doc in a corpus by parsing doc ID

阅读更多关于 how to set author for each doc in a corpus by parsing doc ID

问题 I have a tm Corpus object like this: > summary(corp.eng) A corpus with 154 text documents The metadata consists of 2 tag-value pairs and a data frame Available tags are: create_date creator Available variables in the data frame are: MetaID The metadata for each document in the corpus looks this: > meta(corp.eng[[1]]) Available meta data pairs are: Author : DateTimeStamp: 2013-04-18 14:37:24 Description : Heading : ID : Smith-John_e.txt Language : en_CA Origin : I know that I can set the

r remove sparse terms by type of documents

阅读更多关于 r remove sparse terms by type of documents

问题 I'm new with corpus, I have a big corpus where there are 4 types of documents. I want remove sparse terms inside the types. I can't just create separated corpus because they had a lot of transformations before, using some post I created a TermDocumentMatrix with the name of the type in each column, but I can't find the way to remove sparse terms by type. Any idea? Thanks you!! Just for example i removed sparse terms for all the corpus TDM_1 <- removeSparseTerms(TDM, 0.98) inspect(TDM_1) <

Extract n Words Around Defined Term (Multicase)

阅读更多关于 Extract n Words Around Defined Term (Multicase)

问题 I have a vector of text string s, such as: Sentences <- c("I would have gotten the promotion, but TEST my attendance wasn’t good enough.Let me help you with your baggage.", "Everyone was busy, so I went to the movie alone. Two seats were vacant.", "TEST Rock music approaches at high velocity.", "I am happy to take your TEST donation; any amount will be greatly TEST appreciated.", "A purple pig and a green donkey TEST flew a TEST kite in the middle of the night and ended up sunburnt.", "Rock

How to filter meta data by user-defined statements in R?

阅读更多关于 How to filter meta data by user-defined statements in R?

问题 There is a function called sFilter in R to filter meta data. However, the function is an old (Version: 0.5-10) tm package. Is there any function instead of it in a new version? My code block is; query <- "LEWISSPLIT == 'TRAIN'" trainData <- tm_filter(Corpus, FUN = sFilter, query) It means, get documents which have "TRAIN" value in their LEWISSPLIT attribute. <REUTERS TOPICS=?? LEWISSPLIT=?? CGISPLIT=?? OLDID=?? NEWID=??> 回答1: Just write your own filtering function: trainData <- tm_filter

Text mining with tm.plugin.webmining package using GoogleFinanceSource function

阅读更多关于 Text mining with tm.plugin.webmining package using GoogleFinanceSource function

问题 I am studying text mining on the online book http://tidytextmining.com/. In the fifth chapter: http://tidytextmining.com/dtm.html#financial the following code: library(tm.plugin.webmining) library(purrr) company <- c("Microsoft", "Apple", "Google", "Amazon", "Facebook", "Twitter", "IBM", "Yahoo", "Netflix") symbol <- c("MSFT", "AAPL", "GOOG", "AMZN", "FB", "TWTR", "IBM", "YHOO", "NFLX") download_articles <- function(symbol) { WebCorpus(GoogleFinanceSource(paste0("NASDAQ:", symbol))) } stock