tm loses the metadata when applying tm_map

回眸只為那壹抹淺笑 提交于 2019-12-05 02:24:35

问题


I have a (small) problem with the tm r library. say I have a corpus:

# boilerplate
bcorp <- c("one","two","three","four","five")
myCorpus <- Corpus(VectorSource(bcorp), list(lanuage = "en_US"))
tdm <- TermDocumentMatrix(myCorpus)
Docs(tdm)

Result:

[1] "1" "2" "3" "4" "5"

This works. But when I try to use a transformation tm_map():

# this does not work
myCorpus <- Corpus(VectorSource(bcorp), list(lanuage = "en_US"))
myCorpus <- tm_map(myCorpus, tolower)
tdm <- TermDocumentMatrix(myCorpus)

Gives

Error: inherits(doc, "TextDocument") is not TRUE

The solution proposed in this case was to transform to PlainTextDocument.

# this works but erase the metadata
myCorpus <- Corpus(VectorSource(bcorp), list(lanuage = "en_US"))
myCorpus <- tm_map(myCorpus, tolower)
myCorpus <- tm_map(myCorpus, PlainTextDocument)
tdm <- TermDocumentMatrix(myCorpus)
Docs(tdm)

Result:

[1] "character(0)" "character(0)" "character(0)" "character(0)" "character(0)"

Now it works, but erase all the metadata (in this case the doc names). There is a way to mantain the metadata, or to save and then restore them?


回答1:


I found it.

The line:

myCorpus <- tm_map(myCorpus, PlainTextDocument)

solves the problem but erase the metadata.

I found this answer that explain a better way to use tm_map(). I just have to substitute:

myCorpus <- tm_map(myCorpus, tolower)

with:

myCorpus <- tm_map(myCorpus, content_transformer(tolower))

And all works!



来源:https://stackoverflow.com/questions/25638503/tm-loses-the-metadata-when-applying-tm-map

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!