Search PDF's extract lines with keyword and print Not available if keyword not found

本小妞迷上赌 提交于 2021-02-17 05:23:28

问题


Link for input PDF's

https://drive.google.com/drive/folders/1dcgDpfiVjMTGmYSRGnQA65YjZzv0AwXL?usp=sharing

Code goes through all the PDF files in the path and creates a corpus and separates each line with a separator. Next it checks through all the lines with the given search list and pulls that line and tells if the search word is present in the PDF or not (a <- sapply(unlist(Table_search), grepl, x = tablelines)).

setwd("D:")
tables<- list.files(pattern='pdf$')
tablecorpus <- Corpus(URISource(tables),
                      readerControl = list(reader=readPDF))

tospace <-content_transformer(function(x, pattern) gsub(pattern, " ",x))
tablecorpus <- tm_map(tablecorpus, tospace, "\r")
Table_Filenames <-DublinCore(tablecorpus,"id")
lapply(tables, function(x) strsplit(pdf_text(x), "\n")[[1]]) -> tablelines
tablelist <- unlist(tablelines)  %>% str_split("\n")
Table_search        <- list("Table 14", "Source Data:","VERSION")
a <- sapply(unlist(Table_search), grepl, x = tablelines)

I want the code to print the actual line where ever it finds the keyword in the PDF file like shown in image 2.


回答1:


You can use grep to get the index where you find the text.

Table_search <- c("Table 14", "Source Data:","VERSION")

sapply(Table_search, function(x) {
        sapply(tablelines, function(y) {
          inds <- grep(x, y)
          if(length(inds) > 0) 
            toString(y[inds]) 
          else NA
        })
}) -> a
a

#                          Table 14                       Source Data:                     
#[1,] "Table 14.1.1.1 (Page 1 of 2)" "Source Data: Listing 16.2.1.1.1"
#[2,] "Table 14.1.1.2 (Page 1 of 2)" "Source Data: Listing 16.2.1.1.2"

#     VERSION                                                                            
#[1,] NA                                                                                 
#[2,] "Summary of Subject Status by Ambulatory/Non-Ambulatory at Study Entry, VERSION -2"


来源:https://stackoverflow.com/questions/66096407/search-pdfs-extract-lines-with-keyword-and-print-not-available-if-keyword-not-f

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!