How to extract the url in hyperlinks from a docx file using python

前端 未结 5 1187
清酒与你
清酒与你 2020-12-18 11:21

I\'ve been trying to find out how to get urls from a docx file using python, but failed to find anything, i\'ve tried python-docx, and python-docx2txt, but python-docx only

相关标签:
5条回答
  • 2020-12-18 11:58

    I'm late to this party, but if you want something that pulls all the links out of .docx files and makes a spreadsheet of them (or returns a list of them), I have a script that might do that for you. It includes both the URL and the linked text, and you can feed it a whole folder if you want.

    https://github.com/Colin-Fredericks/hx-py/blob/master/XML_utilities/GetWordLinks.py

    It uses BeautifulSoup and UnicodeCSV, both of which you can also grab from that same repo. Runs in Python3. Instructions at the top of the file. Handles non-ascii characters. Only tested on Mac and Ubuntu so far. Excel does not reliably import Unicode CSVs, though Google Drive does. Offer void() where prohibited.

    0 讨论(0)
  • 2020-12-18 12:03
    def iter_hyperlink_rels(rels):
       for rel in rels:
          if rels[rel].reltype == RT.HYPERLINK:
             yield rels[rel]      
    

    This would remove the error.

    0 讨论(0)
  • 2020-12-18 12:07

    you can use wps save as .hml file,then operate file

    0 讨论(0)
  • 2020-12-18 12:12

    I solved it using the following code to print the hyperlink content from docx

    from docx import Document
    from docx.opc.constants import RELATIONSHIP_TYPE as RT
    
    document = Document('test.docx')
    rels = document.part.rels
    
    def iter_hyperlink_rels(rels):
        for rel in rels:
            if rels[rel].reltype == RT.HYPERLINK:
                yield rels[rel]._target      
    
    print(iter_hyperlink_rels(rels)
    
    0 讨论(0)
  • I am a beginner on Python and have an assignment to use Python to change each hyperlink in a .docx document. Thanks to Kiran's code which gave me hints to do a few guess, trial and errors and finally get it working. Here is the code I have and like to share with other beginners.

    # python to change docx URL hyperlinks:
    ### see: https://stackoverflow.com/questions/40475757/how-to-extract-the-url-in-hyperlinks-from-a-docx-file-using-python
    
    from docx import Document
    from docx.opc.constants import RELATIONSHIP_TYPE as RT
    
    print(" This program changes the hyperlinks detected in a word .docx file \n")
    
    docx_file=input(" Pls input docx filename (without .docx): ")
    
    document = Document(docx_file + ".docx")
    
    rels = document.part.rels
    
    for rel in rels:
       if rels[rel].reltype == RT.HYPERLINK:
          print("\n Origianl link id -", rel, "with detected URL: ", rels[rel]._target)
          new_url=input(" Pls input new URL: ")
          rels[rel]._target=new_url
    
    out_file=docx_file + "-out.docx"
    
    document.save(out_file)
    
    print("\n File saved to: ", out_file)
    

    Thank you, Lapyiu Ho

    0 讨论(0)
提交回复
热议问题