How to extract text from a PDF file?

前端 未结 24 2368
孤城傲影
孤城傲影 2020-11-22 14:05

I\'m trying to extract the text included in this PDF file using Python.

I\'m using the PyPDF2 module, and have the following script:

imp         


        
24条回答
  •  盖世英雄少女心
    2020-11-22 14:34

    I found a solution here PDFLayoutTextStripper

    It's good because it can keep the layout of the original PDF.

    It's written in Java but I have added a Gateway to support Python.

    Sample code:

    from py4j.java_gateway import JavaGateway
    
    gw = JavaGateway()
    result = gw.entry_point.strip('samples/bus.pdf')
    
    # result is a dict of {
    #   'success': 'true' or 'false',
    #   'payload': pdf file content if 'success' is 'true'
    #   'error': error message if 'success' is 'false'
    # }
    
    print result['payload']
    

    Sample output from PDFLayoutTextStripper:

    You can see more details here Stripper with Python

提交回复
热议问题