I have nearly one thousand pdf journal articles in a folder. I need to text mine on all article\'s abstracts from the whole folder. Now I am doing the following:
<
We can use library pdftools
library(pdftools)
# you can use an url or a path
pdf_url <- "https://cran.r-project.org/web/packages/pdftools/pdftools.pdf"
# `pdf_text` converts it to a list
list_output <- pdftools::pdf_text('https://cran.r-project.org/web/packages/pdftools/pdftools.pdf')
# you get an element by page
length(list_output) # 5 elements for a 5 page pdf
# let's print the 5th
cat(list_output[[5]])
# Index
# pdf_attachments (pdf_info), 2
# pdf_convert (pdf_render_page), 3
# pdf_fonts (pdf_info), 2
# pdf_info, 2, 3
# pdf_render_page, 2, 3
# pdf_text, 2
# pdf_text (pdf_info), 2
# pdf_toc (pdf_info), 2
# pdftools (pdf_info), 2
# poppler_config (pdf_render_page), 3
# render (pdf_render_page), 3
# suppressMessages, 2
# 5
To extract abstracts from articles, OP chooses to extract content between Abstract and Introduction.
We'll take a list of CRAN pdfs and extract the author(s) as the text between Author and Maintainer (I handpicked a few that had a compatible format).
For this we loop on our url list then extract the content, collapse all texts into one for each pdf, and then extract the relevant info with regex.
urls <- c(pdftools = "https://cran.r-project.org/web/packages/pdftools/pdftools.pdf",
Rcpp = "https://cran.r-project.org/web/packages/Rcpp/Rcpp.pdf",
jpeg = "https://cran.r-project.org/web/packages/jpeg/jpeg.pdf")
lapply(urls,function(url){
list_output <- pdftools::pdf_text(url)
text_output <- gsub('(\\s|\r|\n)+',' ',paste(unlist(list_output),collapse=" "))
trimws(regmatches(text_output, gregexpr("(?<=Author).*?(?=Maintainer)", text_output, perl=TRUE))[[1]][1])
})
# $pdftools
# [1] "Jeroen Ooms"
#
# $Rcpp
# [1] "Dirk Eddelbuettel, Romain Francois, JJ Allaire, Kevin Ushey, Qiang Kou, Nathan Russell, Douglas Bates and John Chambers"
#
# $jpeg
# [1] "Simon Urbanek "