How to scrape PDFs using Python; specific content only

问题

I am trying to get data from PDFs available on the site

https://usda.library.cornell.edu/concern/publications/3t945q76s?locale=en

For example, If I look at November 2019 report

https://downloads.usda.library.cornell.edu/usda-esmis/files/3t945q76s/dz011445t/mg74r196p/latest.pdf

I need the data on Page 12 for corns, I have to create separate files for ending stocks, exports etc. I am new to Python and I am not sure how to scrape the content separately. If I can figure it out for one month then I can create a loop. But, I am confused on how to proceed for one file.

Can someone help me out here, TIA.

回答1:

Here a little example using PyPDF2 ,requests and BeautifulSoup ...pls check the notes comment , this is for first block ...if you need more is necesary change the value in url variable

# You need install :
# pip install PyPDF2 - > Read and parse your content pdf
# pip install requests - > request for get the pdf
# pip install BeautifulSoup - > for parse the html and find all url hrf with ".pdf" final
from PyPDF2 import PdfFileReader
import requests
import io
from bs4 import BeautifulSoup

url=requests.get('https://usda.library.cornell.edu/concern/publications/3t945q76s?locale=en#release-items')
soup = BeautifulSoup(url.content,"lxml")

for a in soup.find_all('a', href=True):
    mystr= a['href']
    if(mystr[-4:]=='.pdf'):
        print ("url with pdf final:", a['href'])
        urlpdf = a['href']
        response = requests.get(urlpdf)
        with io.BytesIO(response.content) as f:
            pdf = PdfFileReader(f)
            information = pdf.getDocumentInfo()
            number_of_pages = pdf.getNumPages()
            txt = f"""
            Author: {information.author}
            Creator: {information.creator}
            Producer: {information.producer}
            Subject: {information.subject}
            Title: {information.title}
            Number of pages: {number_of_pages}
            """
            # Here the metadata of your pdf
            print(txt)
            # numpage for the number page
            numpage=20
            page = pdf.getPage(numpage)
            page_content = page.extractText()
            # print the content in the page 20            
            print(page_content)

回答2:

I would recommend Beautiful Soup if you need to scrape data from a website ,but it looks like you are going to need OCR for extracting the data from the PDF. There is something called pytesseract. Look into that and the tutorials and you should be set.

回答3:

Try pdfreader. You can extract the tables as PDF markdown containing decoded text strings and parse then as plain texts.


from pdfreader import SimplePDFViewer
fd = open("latest.pdf","rb")
viewer = SimplePDFViewer(fd)
viewer.navigate(12)
viewer.render()
markdown = viewer.canvas.text_content

markdown variable contains all texts including PDF commands (positioning, display): all strings come in brackets followed by Tj or TJ operator. For more on PDF text operators see PDF 1.7 sec. 9.4 Text Objects

You can parse it with regular expressions for example.

来源：https://stackoverflow.com/questions/59130672/how-to-scrape-pdfs-using-python-specific-content-only

标签

python

web-scraping

scrapy

tabula

pdf-scraping