How to extract text from a PDF file?

前端 未结 24 2364
孤城傲影
孤城傲影 2020-11-22 14:05

I\'m trying to extract the text included in this PDF file using Python.

I\'m using the PyPDF2 module, and have the following script:

imp         


        
24条回答
  •  星月不相逢
    2020-11-22 14:24

    I recommend to use pymupdf or pdfminer.six.

    Those packages are not maintained:

    • PyPDF2, PyPDF3, PyPDF4
    • pdfminer (without .six)

    How to read pure text with pymupdf

    There are different options which will give different results, but the most basic one is:

    import fitz  # this is pymupdf
    
    with fitz.open("my.pdf") as doc:
        text = ""
        for page in doc:
            text += page.getText()
    
    print(text)
    

提交回复
热议问题