How to extract text from a PDF file?

前端 未结 24 2257
孤城傲影
孤城傲影 2020-11-22 14:05

I\'m trying to extract the text included in this PDF file using Python.

I\'m using the PyPDF2 module, and have the following script:

imp         


        
24条回答
  •  不要未来只要你来
    2020-11-22 14:21

    I've try many Python PDF converters, and I like to update this review. Tika is one of the best. But PyMuPDF is a good news from @ehsaneha user.

    I did a code to compare them in: https://github.com/erfelipe/PDFtextExtraction I hope to help you.

    Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

    from tika import parser
    
    raw = parser.from_file("///Users/Documents/Textos/Texto1.pdf")
    raw = str(raw)
    
    safe_text = raw.encode('utf-8', errors='ignore')
    
    safe_text = str(safe_text).replace("\n", "").replace("\\", "")
    print('--- safe text ---' )
    print( safe_text )
    

提交回复
热议问题