How to scrape a downloaded PDF file with R

倖福魔咒の 提交于 2020-03-05 03:55:14

问题


I’ve recently gotten into scraping (and programming in general) for my internship, and I came across PDF scraping. Every time I try to read a scanned pdf with R, I can never get it to work. I’ve tried using the file.choose() function to no avail. Do I need to change my directory, or how can I get the pdf from my files into R? The code looks something like this:

    > library(pdftools)
    > text=pdf_text("C:/Users/myname/Documents/renewalscan.pdf")
    > text
    [1] ""

Also, using pdftables leads me here:

    > library(pdftables)
    > convert_pdf("C:/Users/myname/Documents/renewalscan.pdf","my.csv")
    Error in get_content(input_file, format, api_key) : 
    Bad Request (HTTP 400).

回答1:


You should use the packages pdftools and pdftables.

If you are trying to read text inside the pdf, then use pdf_text() function. What goes inside is the path (in your computer or web) to the pdf. For example

tt = pdf_text("C:/Users/Smith/Documents/my_file.pdf")

It would be nice if you were more specif and also give us reproducible example.




回答2:


To use the PDFTables R package, you need to the run the following command:

convert_pdf('test/index.pdf', output_file = NULL, format = "xlsx-single", message = TRUE, api_key = "insert_API_key")



回答3:


If you are looking to get tabular data, you might try tabulizer. Here is a full code tutorial: https://www.business-science.io/code-tools/2019/09/23/tabulizer-pdf-scraping.html

Basically, you can use this code from the tutorial:

library(tabulizer)
extract_tables(
    file   = "2019-09-23-tabulizer/endangered_species.pdf", 
    method = "decide", 
    output = "data.frame")


来源:https://stackoverflow.com/questions/50749759/how-to-scrape-a-downloaded-pdf-file-with-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!