How to scrape PDFs using Python; specific content only

爱⌒轻易说出口 提交于 2021-02-19 08:24:08

问题


I am trying to get data from PDFs available on the site

https://usda.library.cornell.edu/concern/publications/3t945q76s?locale=en

For example, If I look at November 2019 report

https://downloads.usda.library.cornell.edu/usda-esmis/files/3t945q76s/dz011445t/mg74r196p/latest.pdf

I need the data on Page 12 for corns, I have to create separate files for ending stocks, exports etc. I am new to Python and I am not sure how to scrape the content separately. If I can figure it out for one month then I can create a loop. But, I am confused on how to proceed for one file.

Can someone help me out here, TIA.


回答1:


Here a little example using PyPDF2 ,requests and BeautifulSoup ...pls check the notes comment , this is for first block ...if you need more is necesary change the value in url variable

# You need install :
# pip install PyPDF2 - > Read and parse your content pdf
# pip install requests - > request for get the pdf
# pip install BeautifulSoup - > for parse the html and find all url hrf with ".pdf" final
from PyPDF2 import PdfFileReader
import requests
import io
from bs4 import BeautifulSoup

url=requests.get('https://usda.library.cornell.edu/concern/publications/3t945q76s?locale=en#release-items')
soup = BeautifulSoup(url.content,"lxml")

for a in soup.find_all('a', href=True):
    mystr= a['href']
    if(mystr[-4:]=='.pdf'):
        print ("url with pdf final:", a['href'])
        urlpdf = a['href']
        response = requests.get(urlpdf)
        with io.BytesIO(response.content) as f:
            pdf = PdfFileReader(f)
            information = pdf.getDocumentInfo()
            number_of_pages = pdf.getNumPages()
            txt = f"""
            Author: {information.author}
            Creator: {information.creator}
            Producer: {information.producer}
            Subject: {information.subject}
            Title: {information.title}
            Number of pages: {number_of_pages}
            """
            # Here the metadata of your pdf
            print(txt)
            # numpage for the number page
            numpage=20
            page = pdf.getPage(numpage)
            page_content = page.extractText()
            # print the content in the page 20            
            print(page_content)



回答2:


I would recommend Beautiful Soup if you need to scrape data from a website ,but it looks like you are going to need OCR for extracting the data from the PDF. There is something called pytesseract. Look into that and the tutorials and you should be set.




回答3:


Try pdfreader. You can extract the tables as PDF markdown containing decoded text strings and parse then as plain texts.


from pdfreader import SimplePDFViewer
fd = open("latest.pdf","rb")
viewer = SimplePDFViewer(fd)
viewer.navigate(12)
viewer.render()
markdown = viewer.canvas.text_content

markdown variable contains all texts including PDF commands (positioning, display): all strings come in brackets followed by Tj or TJ operator. For more on PDF text operators see PDF 1.7 sec. 9.4 Text Objects

You can parse it with regular expressions for example.



来源:https://stackoverflow.com/questions/59130672/how-to-scrape-pdfs-using-python-specific-content-only

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!