问题
Link for input PDF's
https://drive.google.com/drive/folders/1dcgDpfiVjMTGmYSRGnQA65YjZzv0AwXL?usp=sharing
Code goes through all the PDF files in the path and creates a corpus and separates each line with a separator. Next it checks through all the lines with the given search list and pulls that line and tells if the search word is present in the PDF or not (a <- sapply(unlist(Table_search), grepl, x = tablelines)).
setwd("D:")
tables<- list.files(pattern='pdf$')
tablecorpus <- Corpus(URISource(tables),
readerControl = list(reader=readPDF))
tospace <-content_transformer(function(x, pattern) gsub(pattern, " ",x))
tablecorpus <- tm_map(tablecorpus, tospace, "\r")
Table_Filenames <-DublinCore(tablecorpus,"id")
lapply(tables, function(x) strsplit(pdf_text(x), "\n")[[1]]) -> tablelines
tablelist <- unlist(tablelines) %>% str_split("\n")
Table_search <- list("Table 14", "Source Data:","VERSION")
a <- sapply(unlist(Table_search), grepl, x = tablelines)
I want the code to print the actual line where ever it finds the keyword in the PDF file like shown in image 2.
回答1:
You can use grep to get the index where you find the text.
Table_search <- c("Table 14", "Source Data:","VERSION")
sapply(Table_search, function(x) {
sapply(tablelines, function(y) {
inds <- grep(x, y)
if(length(inds) > 0)
toString(y[inds])
else NA
})
}) -> a
a
# Table 14 Source Data:
#[1,] "Table 14.1.1.1 (Page 1 of 2)" "Source Data: Listing 16.2.1.1.1"
#[2,] "Table 14.1.1.2 (Page 1 of 2)" "Source Data: Listing 16.2.1.1.2"
# VERSION
#[1,] NA
#[2,] "Summary of Subject Status by Ambulatory/Non-Ambulatory at Study Entry, VERSION -2"
来源:https://stackoverflow.com/questions/66096407/search-pdfs-extract-lines-with-keyword-and-print-not-available-if-keyword-not-f