问题
I am creating a Copus from a dataframe. I pass it as a VectorSource
as there is only one column I want to be used as the text source. This works find however I need the document ids within the corpus to match the document ids from the dataframe. The document ids are stored in a separate column in the original dataframe.
df <- as.data.frame(t(rbind(c(1,3,5,7,8,10),
c("text", "lots of text", "too much text", "where will it end", "give peas a chance","help"))))
colnames(df) <- c("ids","textColumn")
library("tm")
library("lsa")
corpus <- Corpus(VectorSource(df[["textColumn"]]))
Running this code creates a corpus however the document ids run from 1-6. Is there any way of creating the corpus with the document ids 1,3,5,7,8,10?
回答1:
Well, one simple but not very elegant way to assign your ids to your documents afterward could be the following :
for (i in 1:length(corpus)) {
attr(corpus[[i]], "ID") <- df$ids[i]
}
回答2:
I know it's probably late for @user1098798, but there is a way how you can specify ids directly when creating the corpus. You need to load the data as DataframeSource()
and add mapping to the columns:
corpus = VCorpus(DataframeSource(df), readerControl = list(reader = readTabular(mapping = list(content = "textColumn", id = "ids"))))
回答3:
Here is a qdap approach to this problem that can handle it without the loop:
Use qdap version >= 1.1.0 right from the get go to convert the dataframe to a Corpus
and the ID tags will be automatically added.
with(df, as.Corpus(textColumn, ids))
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 3
## Content: documents: 6
## Look around a bit
meta(with(df, as.Corpus(textColumn, ids)), tag="id")
inspect(with(df, as.Corpus(textColumn, ids)))
来源:https://stackoverflow.com/questions/14852357/how-can-i-manually-set-the-document-id-in-a-corpus