text-mining

r : Need content_transformer() called by tm_map() to change non-letters to spaces

假如想象 提交于 2019-12-11 15:06:50
问题 This question was migrated from Cross Validated because it can be answered on Stack Overflow. Migrated 4 years ago . In the following code, any characters matching "/|@| \|") will be changed to a space. > library(tm) > toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x)) > docs <- tm_map(docs, toSpace, "/|@| \\|") What code would transform all non-letters to a space? (What goes where the xxxxx's are below.) It is very difficult to put all non-letters in a string... (Very

Assigning weights to different features in R

空扰寡人 提交于 2019-12-11 13:09:33
问题 Is it possible to assign weights to different features before formulating a DFM in R? Consider this example in R str="apple is better than banana" mydfm=dfm(str, ignoredFeatures = stopwords("english"), verbose = FALSE) DFM mydfm looks like: docs apple better banana text1 1 1 1 But, I want to assign weights(apple:5, banana:3) beforehand, so that DFM mydfm looks like: docs apple better banana text1 5 1 3 回答1: I don't think so, however you can easily do it afterwards: library(quanteda) str <-

Using Naive Bayes Classification to Identity a Twitter User's Gender [closed]

99封情书 提交于 2019-12-11 12:16:19
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 6 years ago . I have become part of a project at school that has been a lot of fun so far and it just got a little bit more interesting. I have roughly 600,000 tweets in my possession (each contains screen name, geo location, text, etc.) and my goal is to try to classify each user as either

How is the correct use of stemDocument?

微笑、不失礼 提交于 2019-12-11 11:58:45
问题 I have already read this and this questions, but I still didn't understand the use of stemDocument in tm_map . Let's follow this example: q17 <- VCorpus(VectorSource(x = c("poder", "pode")), readerControl = list(language = "pt", load = TRUE)) lapply(q17, content) $`character(0)` [1] "poder" $`character(0)` [1] "pode" If I use: > stemDocument("poder", language = "portuguese") [1] "pod" > stemDocument("pode", language = "portuguese") [1] "pod" it does work! But if I use: > q17 <- tm_map(q17,

Count number of times a word-wildcard appears in text (in R)

邮差的信 提交于 2019-12-11 11:55:07
问题 I have a vector of either regular words ("activated") or wildcard words ("activat*"). I want to: 1) Count the number of times each word appears in a given text (i.e., if "activated" appears in text, "activated" frequency would be 1). 2) Count the number of times each word wildcard appears in a text (i.e., if "activated" and "activation" appear in text, "activat*" frequency would be 2). I'm able to achieve (1), but not (2). Can anyone please help? thanks. library(tm) library(qdap) text <-

R Warning in stemCompletion and error in TermDocumentMatrix

有些话、适合烂在心里 提交于 2019-12-11 11:19:18
问题 I was followed the instruction from here In slide no. 9 tolower has issue in package tm 0.6 and above I have used myCorpus <- tm_map(myCorpus, content_transformer(tolower) it was duplicate from this stackoverflow but i still get error when run stemCompletion myCorpus <- tm_map(myCorpus, stemCompletion, dictionary = myCorpusCopy) And I follow this instruction for both variable myCorpus and myCorpusCopy to PlainTextDocument corpus <- tm_map(corpus, PlainTextDocument) I was able to execute

Text Mining with R

半城伤御伤魂 提交于 2019-12-11 08:34:29
问题 I need help in text mining using R Title Date Content Boy May 13 2015 "She is pretty", Tom said. Tom is handsome. Animal June 14 2015 The penguin is cute, lion added. Human March 09 2015 Mr Koh predicted that every human is smart... Monster Jan 22 2015 Ms May, a student, said that John has $10.80. May loves you. I would just want to get the opinions from what the people had said. And also, I would like to seek help in getting the percentage (Eg. 9.8%), because when i split the sentences based

'Word2Vec' object has no attribute 'index2word'

不羁的心 提交于 2019-12-11 07:05:12
问题 I'm getting this error "AttributeError: 'Word2Vec' object has no attribute 'index2word'" in following code in python. Anyone knows how can I solve it? Acctually "tfidf_weighted_averaged_word_vectorizer" throws the error. "obli.csv" contains line of sentences. Thank you. from feature_extractors import tfidf_weighted_averaged_word_vectorizer dataset = get_data2() corpus, labels = dataset.data, dataset.target corpus, labels = remove_empty_docs(corpus, labels) # print('Actual class label:',

What does sentiwordnet 3.0 result signify?

随声附和 提交于 2019-12-11 05:39:24
问题 What does the result of sentiwordnett signify?. If the value given for good is 0.6337,does it mean the probability that the word good is positive is 0.6337 or does it mean the word good has a weightage of 0.6337?if it is the weightage given,then value of extraordinary should be greater than good but value given to extraordinary is only 0.272727 . and the format of sentiwordnet is POS ID PosScore NegScore SynsetTerms Gloss How exactly is the final result caluculated? (using the demo code http:

read multiple text files from multiple folders

馋奶兔 提交于 2019-12-11 04:32:48
问题 I'm trying to read all the '*.txt' files in the subfolders, but it seems like there is a problem in the loop. Basically, folders are structured as following: branch1 branch 2 txt.file result I want 1 -------- 2002----------a---------------a ---------2003----------b---------------b+c ----------c 2 ---------2004----------d---------------d ---------2005----------e---------------e+f ----------f So, I've been listing directories into the list, like below: setwd("C:/Users/J/Desktop/research/DATA