Quotes and hyphens not removed by tm package functions while cleaning corpus

左心房为你撑大大i 提交于 2021-02-07 12:39:26

问题


I'm trying to clean the corpus and I've used the typical steps, like the code below:

docs<-Corpus(DirSource(path))
docs<-tm_map(docs,content_transformer(tolower))
docs<-tm_map(docs,content_transformer(removeNumbers))
docs<-tm_map(docs,content_transformer(removePunctuation))
docs<-tm_map(docs,removeWords,stopwords('en'))
docs<-tm_map(docs,stripWhitespace)
docs<-tm_map(docs,stemDocument)
dtm<-DocumentTermMatrix(docs)

Yet when I inspect the matrix there are few words that come with quotes, such as: "we" "company" "code guidelines" -known -accelerated

It seems that the words themselves are inside the quotes but when I try to run removePunctuation code again it doesn't work. Also there are some words with bullets in front of that I also can't remove.

Any help would be greatly appreciated.


回答1:


removePunctuation uses gsub('[[:punct:]]','',x) i.e. removes symbols: !"#$%&'()*+, \-./:;<=>?@[\\\]^_{|}~`. To remove other symbols, like typographic quotes or bullet signs (or any other), declare your own transformation function:

removeSpecialChars <- function(x) gsub("“•”","",x)
docs <- tm_map(docs, removeSpecialChars)

Or you can go further and remove everything that is not alphanumerical symbol or space:

removeSpecialChars <- function(x) gsub("[^a-zA-Z0-9 ]","",x)
docs <- tm_map(docs, removeSpecialChars)



回答2:


A better constructed tokenizer will handle this automatically. Try this:

> require(quanteda)
> text <- c("Enjoying \"my time\".", "Single 'air quotes'.")
> toktexts <- tokenize(toLower(text), removePunct = TRUE, removeNumbers = TRUE)
> toktexts
[[1]]
[1] "enjoying" "my"       "time"    

[[2]]
[1] "single" "air"    "quotes"

attr(,"class")
[1] "tokenizedTexts" "list"          
> dfm(toktexts, stem = TRUE, ignoredFeatures = stopwords("english"), verbose = FALSE)
Creating a dfm from a tokenizedTexts object ...
   ... indexing 2 documents
   ... shaping tokens into data.table, found 6 total tokens
   ... stemming the tokens (english)
   ... ignoring 174 feature types, discarding 1 total features (16.7%)
   ... summing tokens by document
   ... indexing 5 feature types
   ... building sparse matrix
   ... created a 2 x 5 sparse dfm
   ... complete. Elapsed time: 0.016 seconds.
Document-feature matrix of: 2 documents, 5 features.
2 x 5 sparse Matrix of class "dfmSparse"
       features
docs    air enjoy quot singl time
  text1   0     1    0     0    1
  text2   1     0    1     1    0



回答3:


Answer by @cyberj0g requires a small modification for latest version of tm (0.6). Updated code can be written as follow:

removeSpecialChars <- function(x) gsub("[^a-zA-Z0-9 ]","",x)
corpus <- tm_map(corpus, content_transformer(removeSpecialChars))

Thank you @cyberj0g for working code



来源:https://stackoverflow.com/questions/30994194/quotes-and-hyphens-not-removed-by-tm-package-functions-while-cleaning-corpus

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!