Removing rows from Corpus with multiple documents

被刻印的时光 ゝ 提交于 2019-12-05 21:42:49

No need for a for loop - although it's long been a frustrating feature of tm that it's hard to access the texts once they are in a corpus object.

I've interpreted what you mean by "row" as a document - so the example above is two "rows". If this is not the case, this answer needs to be (but can easily be) adjusted.

Try this:

txt <- c("Quick to deploy, easy to use, and offering complete investment
protection,   our product is clearly differentiated from all
competitive offerings by its common, modular platform, seamless
integration, broad range of support to heterogeneous products from
Microsoft,Apple, Oracle and unequalled scalability, support for
industry standards, and business application-to-storage system
correlation capabilities.",
"Microsoft is U.S. registered trademarks of Microsoft Corporation, Oracle is a U.S. registered trademarks of Oracle Corporation and Apple
is a U.S. registered trademarks of Apple Corporation.")

require(tm)
corp <- VCorpus(VectorSource(txt))
textVector <- sapply(corp, as.character)
newCorp <- VCorpus(VectorSource(textVector[-grep("trademark", textVector, 
                                                  ignore.case = TRUE)]))

newCorp now excludes documents containing "trademark". Note that if you do not need plurals of this (e.g. "trademark")

Thank you Ken. Below is the small modification I made for my successful   implementation.

    require(tm)
    corp <- VCorpus(VectorSource(txt))
    textVector <- sapply(corp, as.character)
    for(j in seq(textVector)) {
    newCorp<-textVector
    newCorp[[j]] <- textVector[[j]][-grep("trademarks|trademark",    textVector[[j]], ignore.case = TRUE)]
    }

It seems 'textVector' contains a 'list' of documents. 'for' loop is still needed. 
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!