Read multiple *.rtf files in r

二次信任 提交于 2021-02-08 06:38:37

问题


I have a folder with more than 2,000 rtf documents. I want to import them into r (preferable into a data frame that can be used in combination with the tidytext package). In addition, I need an additional column, adding the filename so that I can link the content of each rtf document to the filename (later, I will also have to extract information from the filename and save it into seperate columns of my data set).

I came across a solution by Jens Leerssen that I tried to adapt to my requirements:

require(textreadr)

read_plus <- function(flnm) {
read_rtf(flnm) %>% 
    mutate(filename = flnm)
}

tbl_with_sources <-
    list.files(path= "./data", pattern = "*.rtf", 
           full.names = TRUE) %>% 
map_df(~read_plus(.))

However, I get the following error message:

Error in UseMethod("mutate_") : no applicable method for 'mutate_' applied to an object of class "character"

Can anyone tell me why this error occurs or propose another solution to my problem?


回答1:


I finally solved the problem, with some workaround.

1) I converted the *.rft files to *.txt files by using the textutil command in the MacOSX terminal:

find . -name \*.rtf -print0 | xargs -0 textutil -convert txt

By doing so, I get also rid of formatting.

2) I then used the read_plus function of Jens Lerrssen. However I now use read.delim instead of read_rtf and included two options (stringsAsFactors and quote) to get rid of warnings and/or errors:

read_plus <- function(flnm) {
    read.delim(flnm, header = FALSE, stringsAsFactors = FALSE, quote = "") %>% 
            mutate(filename = flnm)
}

3) Finally, I read in all the *.txt files and renamed the columnn V1 at the end.

df <- list.files(path = "./data", pattern = "*.txt", 
               full.names = TRUE) %>% 
    map_df(~read_plus(.)) %>%
    rename(paragraph = V1)


来源:https://stackoverflow.com/questions/50002129/read-multiple-rtf-files-in-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!