Extracting text data from PDF files

后端 未结 7 1918
[愿得一人]
[愿得一人] 2020-12-02 11:24

Is it possible to parse text data from PDF files in R? There does not appear to be a relevant package for such extraction, but has anyone attempted or seen this done in R?

7条回答
  •  难免孤独
    2020-12-02 11:42

    I used an external utility to do the conversion and called it from R. All files had a leading table with the desired information

    Set path to pdftotxt.exe and convert pdf to text

    exeFile <- "C:/Projects/xpdfbin-win-3.04/bin64/pdftotext.exe"
    
    for(i in 1:length(pdfFracList)){
        fileNumber <- str_sub(pdfFracList[i], start = 1, end = -5)
        pdfSource <- paste0(reportDir,"/", fileNumber, ".pdf")
        txtDestination <- paste0(reportDir,"/", fileNumber, ".txt")
        print(paste0("File number ", i, ", Processing file ", pdfSource))
        system(paste(exeFile, "-table" , pdfSource, txtDestination, sep = " "), wait = TRUE)
    }
    

提交回复
热议问题