how to set author for each doc in a corpus by parsing doc ID

懵懂的女人 提交于 2019-12-13 04:24:21

问题


I have a tm Corpus object like this:

> summary(corp.eng)
A corpus with 154 text documents

The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
  create_date creator 
Available variables in the data frame are:
  MetaID

The metadata for each document in the corpus looks this:

> meta(corp.eng[[1]])
Available meta data pairs are:
  Author       : 
  DateTimeStamp: 2013-04-18 14:37:24
  Description  : 
  Heading      : 
  ID           : Smith-John_e.txt
  Language     : en_CA
  Origin       :

I know that I can set the Author of one document at a time with this:

meta(corp.eng[[1]],tag="Author") <-  
  paste(
    rev(
      unlist(
        strsplit(meta(corp.eng[[1]],tag="ID"), c("[-_]"))
      )[1:2]
    ), collapse=' ')

which gives me a result like this:

> meta(corp.eng[[1]],tag="Author")
[1] "John Smith" 

How do I batch the job?


回答1:


NOTE: This should still probably be a comment, but there is some working portion, so here goes an example:

data(crude)
extracted.values <- meta(crude,tag="Places",type="local")
for (i in seq_along(extracted.values)) {
     meta(crude[[i]],tag="Places") <- substr(extracted.values[[i]],1,3)
}

One should be able to do it using lapply as well, but as I am not familiar with the inner workings of tm, I'll stick with loop. Substitute the substr function with the one you need, and the data on the left side as well of course. Hope this helps.



来源:https://stackoverflow.com/questions/16090001/how-to-set-author-for-each-doc-in-a-corpus-by-parsing-doc-id

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!