text-mining

How to write custom removePunctuation() function to better deal with Unicode chars?

非 Y 不嫁゛ 提交于 2019-11-30 14:21:46
In the source code of the tm text-mining R-package, in file transform.R , there is the removePunctuation() function, currently defined as: function(x, preserve_intra_word_dashes = FALSE) { if (!preserve_intra_word_dashes) gsub("[[:punct:]]+", "", x) else { # Assume there are no ASCII 1 characters. x <- gsub("(\\w)-(\\w)", "\\1\1\\2", x) x <- gsub("[[:punct:]]+", "", x) gsub("\1", "-", x, fixed = TRUE) } } I need to parse and mine some abstracts from a science conference (fetched from their website as UTF-8). The abstracts contain some unicode characters that need to be removed, particularly at

Adding custom stopwords in R tm

假如想象 提交于 2019-11-30 11:46:50
I have a Corpus in R using the tm package. I am applying the removeWords function to remove stopwords tm_map(abs, removeWords, stopwords("english")) Is there a way to add my own custom stop words to this list? stopwords just provides you with a vector of words, just c ombine your own ones to this. tm_map(abs, removeWords, c(stopwords("english"),"my","custom","words")) Reza Rahimi Save your custom stop words in a csv file (ex: word.csv ). library(tm) stopwords <- read.csv("word.csv", header = FALSE) stopwords <- as.character(stopwords$V1) stopwords <- c(stopwords, stopwords()) Then you can

How to extract textual contents from a web page? [closed]

我怕爱的太早我们不能终老 提交于 2019-11-30 10:39:29
I'm developing an application in java which can take textual information from different web pages and will summarize it into one page.For example,suppose I have a news on different web pages like Hindu,Times of India,Statesman,etc.Now my application is supposed to extract important points from each one of these pages and will put them together as a single news.The application is based on concepts of web content mining.As a beginner to this field,I can't understand where to start off.I have gone through research papers which explains noise removal as first step in buiding this application. So

Finding ngrams in R and comparing ngrams across corpora

有些话、适合烂在心里 提交于 2019-11-30 06:56:52
I'm getting started with the tm package in R, so please bear with me and apologies for the big ol' wall of text. I have created a fairly large corpus of Socialist/Communist propaganda and would like to extract newly coined political terms (multiple words, e.g. "struggle-criticism-transformation movement"). This is a two-step question, one regarding my code so far and one regarding how I should go on. Step 1: To do this, I wanted to identify some common ngrams first. But I get stuck very early on. Here is what I've been doing: library(tm) library(RWeka) a <-Corpus(DirSource("/mycorpora/1965"),

How to break conversation data into pairs of (Context , Response)

懵懂的女人 提交于 2019-11-30 06:48:11
I'm using Gensim Doc2Vec model, trying to cluster portions of a customer support conversations. My goal is to give the support team an auto response suggestions. Figure 1: shows a sample conversations where the user question is answered in the next conversation line, making it easy to extract the data: during the conversation "hello" and "Our offices are located in NYC" should be suggested Figure 2: describes a conversation where the questions and answers are not in sync during the conversation "hello" and "Our offices are located in NYC" should be suggested Figure 3: describes a conversation

tm: read in data frame, keep text id's, construct DTM and join to other dataset

我与影子孤独终老i 提交于 2019-11-30 05:22:11
I'm using package tm. Say I have a data frame of 2 columns, 500 rows. The first column is ID which is randomly generated and has both character and number in it: "txF87uyK" The second column is actual text : "Today's weather is good. John went jogging. blah, blah,..." Now I want to create a document-term matrix from this data frame. My problem is I want to keep the ID information so that after I got the document-term matrix, I can join this matrix with another matrix that has each row being other information (date, topic, sentiment) of each document and each row is identified by document ID.

How to recreate same DocumentTermMatrix with new (test) data

只谈情不闲聊 提交于 2019-11-30 05:14:14
Suppose I have text based training data and testing data. To be more specific, I have two data sets - training and testing - and both of them have one column which contains text and is of interest for the job at hand. I used tm package in R to process the text column in the training data set. After removing the white spaces, punctuation, and stop words, I stemmed the corpus and finally created a document term matrix of 1 grams containing the frequency/count of the words in each document. I then took a pre-determined cut-off of, say, 50 and kept only those terms that have a count of greater

R, merge multiple rows of text data frame into one cell

☆樱花仙子☆ 提交于 2019-11-30 04:48:19
问题 I have a text data frame that looks like below. > nrow(gettext.df) [1] 3 > gettext.df gettext 1 hello, 2 Good to hear back from you. 3 I've currently written an application and I'm happy about it I wanted to merge this text data into one cell (to do sentiment analysis) as below > gettext.df gettext 1 hello, Good to hear back from you. I've currently written an application and I'm happy about it so I collapsed the cell using below code paste(gettext.df, collapse =" ") but it seems like it

Bytes vs Characters vs Words - which granularity for n-grams?

[亡魂溺海] 提交于 2019-11-30 03:55:01
问题 At least 3 types of n-grams can be considered for representing text documents: byte-level n-grams character-level n-grams word-level n-grams It's unclear to me which one should be used for a given task (clustering, classification, etc). I read somewhere that character-level n-grams are preferred to word-level n-grams when the text contains typos, so that "Mary loves dogs" remains similar to "Mary lpves dogs". Are there other criteria to consider for choosing the "right" representation? 回答1:

R tm removeWords function not removing words

被刻印的时光 ゝ 提交于 2019-11-30 03:21:51
问题 I am trying to remove some words from a corpus I have built but it doesn't seem to be working. I first run through everything and create a dataframe that lists my words in order of their frequency. I use this list to identify words I am not interested in and then try to create a new list with the words removed. However, the words remain in my dataset. I am wondering what I am doing wrong and why the words aren't being removed? I have included the full code below: install.packages("rvest")