Extracting bold text from Resumes( .Docx,.Doc,PDF) using Python

♀尐吖头ヾ 提交于 2019-12-01 12:06:29

问题


I have thousands of resumes in any format like word with .doc, .docx and pdf.

I want to extract bold text from these documents using textract library in python. is there a way to extract using textract?


回答1:


An easy solution would be to use the python-docx package. install the package using ( !pip install python-docx )

You'll need to convert your pdf files to .docx . you can do that using any online pdf to docx converter or use python to do that.

the following lines of codes will extract all bold and italic contents of your resumes and save them in a dictionary called boltalic_Dict. you may retrieve either later on.

from docx import *

document = Document('path_to_your_files')
bolds=[]
italics=[]
for para in document.paragraphs:
    for run in para.runs:
        if run.italic :
            italics.append(run.text)
        if run.bold :
            bolds.append(run.text)

boltalic_Dict={'bold_phrases':bolds,
              'italic_phrases':italics}

I hope this helps.



来源:https://stackoverflow.com/questions/52125333/extracting-bold-text-from-resumes-docx-doc-pdf-using-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!