finding on which page a search string is located in a pdf document using python

前端未结

关注

 3  1512

旧巷少年郎

Which python packages can I use to find out out on which page a specific “search string” is located ?

I looked into several python pdf packages but couldn\'t figur

相关标签:

3条回答

悲&欢浪女

2020-12-15 13:09
In addition to what @user1043144 mentioned,

To use with python 3.x

Use PyPDF2
```
import PyPDF2
```
Use open instead of file
```
PdfFileReader(open(xFile, 'rb'))
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

清歌不尽

2020-12-15 13:09

I was able to successfully get the output using the code below.

Code:

import PyPDF2
import re

# Open the pdf file
object = PyPDF2.PdfFileReader(r"C:\TEST.pdf")

# Get number of pages
NumPages = object.getNumPages()

# Enter code here
String = "Enter_the_text_to_Search_here"

# Extract text and do the search
for i in range(0, NumPages):
    PageObj = object.getPage(i)
    Text = PageObj.extractText()
    if re.search(String,Text):
         print("Pattern Found on Page: " + str(i))

Sample Output:

Pattern Found on Page: 7

0 讨论(0)

自闭症患者

2020-12-15 13:14

I finally figured out that pyPDF can help. I am posting it in case it can help somebody else.

(1) a function to locate the string

def fnPDF_FindText(xFile, xString):
    # xfile : the PDF file in which to look
    # xString : the string to look for
    import pyPdf, re
    PageFound = -1
    pdfDoc = pyPdf.PdfFileReader(file(xFile, "rb"))
    for i in range(0, pdfDoc.getNumPages()):
        content = ""
        content += pdfDoc.getPage(i).extractText() + "\n"
        content1 = content.encode('ascii', 'ignore').lower()
        ResSearch = re.search(xString, content1)
        if ResSearch is not None:
           PageFound = i
           break
     return PageFound

(2) a function to extract the pages of interest

  def fnPDF_ExtractPages(xFileNameOriginal, xFileNameOutput, xPageStart, xPageEnd):
      from pyPdf import PdfFileReader, PdfFileWriter
      output = PdfFileWriter()
      pdfOne = PdfFileReader(file(xFileNameOriginal, "rb"))
      for i in range(xPageStart, xPageEnd):
          output.addPage(pdfOne.getPage(i))
          outputStream = file(xFileNameOutput, "wb")
          output.write(outputStream)
          outputStream.close()

I hope this will be helpful to somebody else

0 讨论(0)