tm

Plot the evolution of an LDA topic across time

元气小坏坏 提交于 2020-01-13 05:59:29
问题 I'd like to plot how the proportion of a particular topic changes over time, but I've been having some trouble isolating a single topic and plotting over time, especially for plotting multiple groups of documents separately (let's create two groups to compare - journals A and B). I've saved dates associated with these journals in a function called dateConverter . Here's what I have so far (with much thanks to @scoa): library(tm); library(topicmodels); txtfolder <- "~/path/to/documents/"

Remove all punctuation from text including apostrophes for tm package

谁说我不能喝 提交于 2020-01-11 11:26:30
问题 I have a of vector consisting of Tweets (just the message text) that I am cleaning for text mining purposes. I have used removePunctuation from the tm package like so: clean_tweet_text = removePunctuation(tweet_text) This have resulted in a vector with all punctuation removed from the text except apostrophes, which ruins my keyword searches because words touching apostrophes are not registered. For example, one of my keywords is climate but if a tweet has 'climate it won't be counted. How can

How Do I Parse a Date Time String That Includes Fractional Time?

匆匆过客 提交于 2020-01-10 05:11:10
问题 I have a date time string: 20:48:01.469 UTC MAR 31 2016 I would like to convert this string representation of time to a struct tm using strptime , but my format string isn't working. Is there a format specifier for fractional seconds? Perhaps %S , %s , or something else? Code snippet is below: tm tmbuf; const char *str = "20:48:01.469 UTC MAR 31 2016" const char *fmt = "%H:%M:%s %Z %b %d %Y"; strptime(str,fmt,&tmbuf); 回答1: Using this free, open source C++11/14 library, here is another way to

Subset a corpus by meta data?

那年仲夏 提交于 2020-01-05 10:09:33
问题 I feel like this should be easier, but I cannot figure this out. How do I filter out documents from a corpus based on metadata. To be more specific, I have a corpus of 576 documents, each of which has the tag 'Section'. Section has a number of different values such as, "News", "Editorial" and "Comment". How do i use tm_filter to filter out documents, say, that have "Editorial" and or "Comment" in this? I'm sorry I haven't provided reproducible data. I don't really know how to go about

Find frequency of a custom word in R TermDocumentMatrix using TM package

喜欢而已 提交于 2020-01-05 04:28:10
问题 I turned about 50,000 rows of varchar data into a corpus, and then proceeded to clean said corpus using the TM package, getting ride of stopwords, punctuation, and numbers. I then turned it into a TermDocumentMatrix and used the functions findFreqTerms and findMostFreqTerms to run text analysis. findMostFreqTerms return the common words, and the number of times it shows up in the data. However, I want to use a function that says search for "word" and return how many times "word" appears in

why does as.matrix result in memory overload while running text mining in R

扶醉桌前 提交于 2020-01-04 20:45:43
问题 I am doing a text analysis with R package tm. My code is based on this link: https://www.r-bloggers.com/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know/ The text-files I load are only 4800 kB. The text files are a 10% sample of the original files I want to analyze. My code is: library(tm) library(wordcloud) library(SnowballC) library(textmineR) library(RWeka) blogssub <- readLines("10kblogs.txt") newssub <- readLines("10knews.txt") tweetssub <- readLines(

Text mining pdf files/issues with word frequencies

三世轮回 提交于 2020-01-04 03:59:04
问题 I am trying to mine a pdf of an article with rich pdf encodings and graphs. I noticed that when i mine some pdf documents i get the high frequency words to be phi, taeoe,toe,sigma, gamma etc. It works well with some pdf documents but i get these random greek letters with others. Is this the problem with character encoding? (Btw all the documents are in english). Any suggestions? # Here is the link to pdf file for testing # www.sciencedirect.com/science/article/pii/S0164121212000532 library(tm

Treat words separated by space in the same manner

送分小仙女□ 提交于 2020-01-03 07:30:21
问题 I am trying to find the words occurring in multiple documents at the same time. Let us take an example. doc1: "this is a document about milkyway" doc2: "milky way is huge" As you can see in above 2 documents, word "milkyway" is occurring in both the docs but in the second document term "milkyway" is separated by a space and in first doc it is not. I am doing the following to get the document term matrix in R. library(tm) tmp.text <- data.frame(rbind(doc1, doc2)) tmp.corpus <- Corpus

Split delimited strings into distinct columns in R dataframe

拜拜、爱过 提交于 2020-01-01 19:25:14
问题 I need a fast and concise way to split string literals in a data framte into a set of columns. Let's say I have this data frame data <- data.frame(id=c(1,2,3), tok1=c("a, b, c", "a, a, d", "b, d, e"), tok2=c("alpha|bravo", "alpha|charlie", "tango|tango|delta") ) (pls note the different delimiters among columns) The number of string columns is usually not known in advance (altough I can try to discover the whole set of cases if I've no alternatives) I need two data frames like those: tok1

How to extract sentences containing specific person names using R

半城伤御伤魂 提交于 2020-01-01 09:33:08
问题 I am using R to extract sentences containing specific person names from texts and here is a sample paragraph: Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin. Melanchthon became professor of the Greek language in Wittenberg at the age of 21. He studied the Scripture, especially of Paul, and Evangelical doctrine. He was present at the disputation of Leipzig (1519) as a spectator, but