Removing rows from Corpus with multiple documents

问题

I have 4000 text documents in corpus. I want to remove row(s) that contains a specific word from each document as a part of data clean up.

For example:

library(tm)
doc.corpus<-  VCorpus(DirSource("C:\\TextMining\\Prototype",pattern="*.txt",encoding= "UTF8",mode = "text"),readerControl=list(language="en"))

doc.corpus<- tm_map(doc.corpus, PlainTextDocument)

doc.corpus[[1]]

#PlainTextDocument
Metadata:  7
Content:  chars: 16542

    as.character(doc.corpus)[[1]]


$content


"Quick to deploy, easy to use, and offering complete investment
protection,   our product is clearly differentiated from all
competitive offerings by its common, modular platform, seamless
integration, broad range of support to heterogeneous products from
Microsoft,Apple, Oracle and unequalled scalability, support for
industry standards, and business application-to-storage system
correlation capabilities."
"Microsoft is U.S. registered trademarks of Microsoft Corporation, Oracle is a U.S. registered trademarks of Oracle Corporation and Apple
is a U.S. registered trademarks of Apple Corporation."

My problem is to remove 2nd row that contains word "trademark" from this and all other documents. Currently I used grepl() function to identify the rows and tried to exclude those rows using an approach that is typically used while working with data frame, which did not work:

corpus.copy<-corpus.doc
corpus.doc[[1]]<-corpus.copy[[1]][!grepl("trademark",as.character(corpus.copy[[1]]),ignore.case = TRUE),]

As long as it works for the first document, I could easily use "for loop" to implement in all documents within Corpus.

Any hints/solution is appreciated. I could have easily used alternative route by converting Corpus to data frame to remove the undesirable rows and convert back to Corpus again. Thanks.

System.info:
[1] "x86_64-w64-mingw32"; 
[1] "R version 3.1.0 (2014-04-10)"
[1] tm_0.6-2

回答1:

No need for a for loop - although it's long been a frustrating feature of tm that it's hard to access the texts once they are in a corpus object.

I've interpreted what you mean by "row" as a document - so the example above is two "rows". If this is not the case, this answer needs to be (but can easily be) adjusted.

Try this:

txt <- c("Quick to deploy, easy to use, and offering complete investment
protection,   our product is clearly differentiated from all
competitive offerings by its common, modular platform, seamless
integration, broad range of support to heterogeneous products from
Microsoft,Apple, Oracle and unequalled scalability, support for
industry standards, and business application-to-storage system
correlation capabilities.",
"Microsoft is U.S. registered trademarks of Microsoft Corporation, Oracle is a U.S. registered trademarks of Oracle Corporation and Apple
is a U.S. registered trademarks of Apple Corporation.")

require(tm)
corp <- VCorpus(VectorSource(txt))
textVector <- sapply(corp, as.character)
newCorp <- VCorpus(VectorSource(textVector[-grep("trademark", textVector, 
                                                  ignore.case = TRUE)]))

newCorp now excludes documents containing "trademark". Note that if you do not need plurals of this (e.g. "trademark")

回答2:

Thank you Ken. Below is the small modification I made for my successful   implementation.

    require(tm)
    corp <- VCorpus(VectorSource(txt))
    textVector <- sapply(corp, as.character)
    for(j in seq(textVector)) {
    newCorp<-textVector
    newCorp[[j]] <- textVector[[j]][-grep("trademarks|trademark",    textVector[[j]], ignore.case = TRUE)]
    }

It seems 'textVector' contains a 'list' of documents. 'for' loop is still needed.

来源：https://stackoverflow.com/questions/34646291/removing-rows-from-corpus-with-multiple-documents

标签