finding on which page a search string is located in a pdf document using python

前端 未结 3 1506
旧巷少年郎
旧巷少年郎 2020-12-15 12:36

Which python packages can I use to find out out on which page a specific “search string” is located ?

I looked into several python pdf packages but couldn\'t figur

相关标签:
3条回答
  • 2020-12-15 13:09

    In addition to what @user1043144 mentioned,

    To use with python 3.x

    Use PyPDF2

    import PyPDF2
    

    Use open instead of file

    PdfFileReader(open(xFile, 'rb'))
    
    0 讨论(0)
  • 2020-12-15 13:09

    I was able to successfully get the output using the code below.

    Code:

    import PyPDF2
    import re
    
    # Open the pdf file
    object = PyPDF2.PdfFileReader(r"C:\TEST.pdf")
    
    # Get number of pages
    NumPages = object.getNumPages()
    
    # Enter code here
    String = "Enter_the_text_to_Search_here"
    
    # Extract text and do the search
    for i in range(0, NumPages):
        PageObj = object.getPage(i)
        Text = PageObj.extractText()
        if re.search(String,Text):
             print("Pattern Found on Page: " + str(i))
    

    Sample Output:

    Pattern Found on Page: 7
    
    0 讨论(0)
  • 2020-12-15 13:14

    I finally figured out that pyPDF can help. I am posting it in case it can help somebody else.

    (1) a function to locate the string

    def fnPDF_FindText(xFile, xString):
        # xfile : the PDF file in which to look
        # xString : the string to look for
        import pyPdf, re
        PageFound = -1
        pdfDoc = pyPdf.PdfFileReader(file(xFile, "rb"))
        for i in range(0, pdfDoc.getNumPages()):
            content = ""
            content += pdfDoc.getPage(i).extractText() + "\n"
            content1 = content.encode('ascii', 'ignore').lower()
            ResSearch = re.search(xString, content1)
            if ResSearch is not None:
               PageFound = i
               break
         return PageFound
    

    (2) a function to extract the pages of interest

      def fnPDF_ExtractPages(xFileNameOriginal, xFileNameOutput, xPageStart, xPageEnd):
          from pyPdf import PdfFileReader, PdfFileWriter
          output = PdfFileWriter()
          pdfOne = PdfFileReader(file(xFileNameOriginal, "rb"))
          for i in range(xPageStart, xPageEnd):
              output.addPage(pdfOne.getPage(i))
              outputStream = file(xFileNameOutput, "wb")
              output.write(outputStream)
              outputStream.close()
    

    I hope this will be helpful to somebody else

    0 讨论(0)
提交回复
热议问题