How to extract text from a PDF file?

前端 未结 24 2236
孤城傲影
孤城傲影 2020-11-22 14:05

I\'m trying to extract the text included in this PDF file using Python.

I\'m using the PyPDF2 module, and have the following script:

imp         


        
24条回答
  •  青春惊慌失措
    2020-11-22 14:22

    If you try it in Anaconda on Windows, PyPDF2 might not handle some of the PDFs with non-standard structure or unicode characters. I recommend using the following code if you need to open and read a lot of pdf files - the text of all pdf files in folder with relative path .//pdfs// will be stored in list pdf_text_list.

    from tika import parser
    import glob
    
    def read_pdf(filename):
        text = parser.from_file(filename)
        return(text)
    
    
    all_files = glob.glob(".\\pdfs\\*.pdf")
    pdf_text_list=[]
    for i,file in enumerate(all_files):
        text=read_pdf(file)
        pdf_text_list.append(text['content'])
    
    print(pdf_text_list)
    

提交回复
热议问题