How to extract text from a PDF file?

前端 未结 24 2258
孤城傲影
孤城傲影 2020-11-22 14:05

I\'m trying to extract the text included in this PDF file using Python.

I\'m using the PyPDF2 module, and have the following script:

imp         


        
24条回答
  •  Happy的楠姐
    2020-11-22 14:20

    If wanting to extract text from a table, I've found tabula to be easily implemented, accurate, and fast:

    to get a pandas dataframe:

    import tabula
    
    df = tabula.read_pdf('your.pdf')
    
    df
    

    By default, it ignores page content outside of the table. So far, I've only tested on a single-page, single-table file, but there are kwargs to accommodate multiple pages and/or multiple tables.

    install via:

    pip install tabula-py
    # or
    conda install -c conda-forge tabula-py 
    

    In terms of straight-up text extraction see: https://stackoverflow.com/a/63190886/9249533

提交回复
热议问题