tm

remove duplicates from list based on semantic similarity/relatedness

萝らか妹 提交于 2019-11-30 10:10:23
R + tm: How do I de-duplicate items in a list, based on semantic similarity? v<-c("bank","banks","banking", "ford_suv',"toyota_suv","nissan_suv") . My expected solution would be c("bank", "ford_suv',"toyota_suv","nissan_suv") . That is, bank, banks and banking to be reduced to one term "bank." SnowBall::stemming is not an option because I have to retain the flavor of newspaper styles of various countries. Any help or direction will be useful. We could calculate the Levenshtein distance between words using adist and regroup them into clusters using hclust d <- adist(v) rownames(d) <- v Which

Creating N-Grams with tm & RWeka - works with VCorpus but not Corpus

允我心安 提交于 2019-11-30 07:31:15
Following the many guides to creating biGrams using the 'tm' and 'RWeka' packages, I was getting frustrated that only 1-Grams were being returned in the tdm . Through much trial and error I discovered that proper function was achieved using ' VCorpus ' but not using ' Corpus '. BTW, I'm pretty sure this was working with 'Corpus' ~1 month ago but it is not now. R (3.3.3), RTools (3.4), RStudio (1.0.136) and all packages (tm 0.7-1, RWeka 0.4-31) have been updated to the latest. I would appreciate any insight on what this won't work with Corpus and if others have this same problem. #A

How to show corpus text in R tm package?

我与影子孤独终老i 提交于 2019-11-30 06:59:46
问题 I'm completely new in R and tm package, so please excuse my stupid question ;-) How can I show the text of a plain text corpus in R tm package? I've loaded a corpus with 323 plain text files in a corpus: src <- DirSource("Korpora/technologie") corpus <- Corpus(src) But when I call the corpus with: corpus[[1]] I always get some output like this instead of the corpus text itself: <<PlainTextDocument>> Metadata: 7 Content: chars: 144 Content: chars: 141 Content: chars: 224 Content: chars: 75

Finding ngrams in R and comparing ngrams across corpora

有些话、适合烂在心里 提交于 2019-11-30 06:56:52
I'm getting started with the tm package in R, so please bear with me and apologies for the big ol' wall of text. I have created a fairly large corpus of Socialist/Communist propaganda and would like to extract newly coined political terms (multiple words, e.g. "struggle-criticism-transformation movement"). This is a two-step question, one regarding my code so far and one regarding how I should go on. Step 1: To do this, I wanted to identify some common ngrams first. But I get stuck very early on. Here is what I've been doing: library(tm) library(RWeka) a <-Corpus(DirSource("/mycorpora/1965"),

tm: read in data frame, keep text id's, construct DTM and join to other dataset

我与影子孤独终老i 提交于 2019-11-30 05:22:11
I'm using package tm. Say I have a data frame of 2 columns, 500 rows. The first column is ID which is randomly generated and has both character and number in it: "txF87uyK" The second column is actual text : "Today's weather is good. John went jogging. blah, blah,..." Now I want to create a document-term matrix from this data frame. My problem is I want to keep the ID information so that after I got the document-term matrix, I can join this matrix with another matrix that has each row being other information (date, topic, sentiment) of each document and each row is identified by document ID.

How to recreate same DocumentTermMatrix with new (test) data

只谈情不闲聊 提交于 2019-11-30 05:14:14
Suppose I have text based training data and testing data. To be more specific, I have two data sets - training and testing - and both of them have one column which contains text and is of interest for the job at hand. I used tm package in R to process the text column in the training data set. After removing the white spaces, punctuation, and stop words, I stemmed the corpus and finally created a document term matrix of 1 grams containing the frequency/count of the words in each document. I then took a pre-determined cut-off of, say, 50 and kept only those terms that have a count of greater

Removing non-English text from Corpus in R using tm()

流过昼夜 提交于 2019-11-30 03:57:07
I am using tm() and wordcloud() for some basic data-mining in R, but am running into difficulties because there are non-English characters in my dataset (even though I've tried to filter out other languages based on background variables. Let's say that some of the lines in my TXT file (saved as UTF-8 in TextWrangler) look like this: Special satisfação Happy Sad Potential für I then read my txt file into R: words <- Corpus(DirSource("~/temp", encoding = "UTF-8"),readerControl = list(language = "lat")) This yields the warning message: Warning message: In readLines(y, encoding = x$Encoding) :

R tm removeWords function not removing words

被刻印的时光 ゝ 提交于 2019-11-30 03:21:51
问题 I am trying to remove some words from a corpus I have built but it doesn't seem to be working. I first run through everything and create a dataframe that lists my words in order of their frequency. I use this list to identify words I am not interested in and then try to create a new list with the words removed. However, the words remain in my dataset. I am wondering what I am doing wrong and why the words aren't being removed? I have included the full code below: install.packages("rvest")

Error converting text to lowercase with tm_map(…, tolower)

与世无争的帅哥 提交于 2019-11-29 23:11:48
I tried using the tm_map . It gave the following error. How can I get around this? require(tm) byword<-tm_map(byword, tolower) Error in UseMethod("tm_map", x) : no applicable method for 'tm_map' applied to an object of class "character" Use the base R function tolower() : tolower(c("THE quick BROWN fox")) # [1] "the quick brown fox" daroczig Expanding my comment to a more detailed answer here: you have to wrap tolower inside of content_transformer not to screw up the VCorpus object -- something like: > library(tm) > data('crude') > crude[[1]]$content [1] "Diamond Shamrock Corp said that

Use R to convert PDF files to text files for text mining

允我心安 提交于 2019-11-29 22:24:21
I have nearly one thousand pdf journal articles in a folder. I need to text mine on all article's abstracts from the whole folder. Now I am doing the following: dest <- "~/A1.pdf" # set path to pdftotxt.exe and convert pdf to text exe <- "C:/Program Files (x86)/xpdfbin-win-3.03/bin32/pdftotext.exe" system(paste("\"", exe, "\" \"", dest, "\"", sep = ""), wait = F) # get txt-file name and open it filetxt <- sub(".pdf", ".txt", dest) shell.exec(filetxt) By this, I am converting one pdf file to one .txt file and then copying the abstract in another .txt file and compile it manually. This work is