How can I manually set the document id in a corpus?

问题

I am creating a Copus from a dataframe. I pass it as a VectorSource as there is only one column I want to be used as the text source. This works find however I need the document ids within the corpus to match the document ids from the dataframe. The document ids are stored in a separate column in the original dataframe.

df <- as.data.frame(t(rbind(c(1,3,5,7,8,10), 
                        c("text", "lots of text", "too much text", "where will it end",         "give peas a chance","help"))))
colnames(df) <- c("ids","textColumn")
library("tm")
library("lsa")
corpus <- Corpus(VectorSource(df[["textColumn"]]))

Running this code creates a corpus however the document ids run from 1-6. Is there any way of creating the corpus with the document ids 1,3,5,7,8,10?

回答1:

Well, one simple but not very elegant way to assign your ids to your documents afterward could be the following :

for (i in 1:length(corpus)) {
   attr(corpus[[i]], "ID") <- df$ids[i]
}

回答2:

I know it's probably late for @user1098798, but there is a way how you can specify ids directly when creating the corpus. You need to load the data as DataframeSource() and add mapping to the columns:

corpus = VCorpus(DataframeSource(df), readerControl = list(reader = readTabular(mapping = list(content = "textColumn", id = "ids"))))

回答3:

Here is a qdap approach to this problem that can handle it without the loop:

Use qdap version >= 1.1.0 right from the get go to convert the dataframe to a Corpus and the ID tags will be automatically added.

with(df, as.Corpus(textColumn, ids))

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 3
## Content:  documents: 6


## Look around a bit
meta(with(df, as.Corpus(textColumn, ids)), tag="id")
inspect(with(df, as.Corpus(textColumn, ids)))

来源：https://stackoverflow.com/questions/14852357/how-can-i-manually-set-the-document-id-in-a-corpus

标签