How to extract images from Word documents using JavaScript?

问题

I am trying to extract images from Word documents using the ActiveXObject in JavaScript (IE only).

I was unable to find any API reference for the Word object, only a few hints from around the Internet:

var filename = 'path/to/word/doc.docx'
var word = new ActiveXObject('Word.Application')
var doc = w.Documents.Open(filename)
// Displays the text
var docText = doc.Content

How would I access images in the Word doc using something like doc.Content?

Also, if anyone has a definitive source (preferably from Microsoft) for the API that'd be extremely helpful.

回答1:

So after a few weeks of research, I found it would be easiest to extract the images by using the SaveAs function that is part of the Word ActiveXObject. If the file is saved as an HTML document, Word will make a folder containing the images.

From there, you can use XMLHttp to grab the HTML file and create new IMG tags that can be viewable by the browser (I'm using IE (9) because the ActiveXObject only works in Internet Explorer).

Let's begin with the SaveAs portion:

// Define the path to the file
var filepath = 'path/to/the/word/doc.docx'
// Make a new ActiveXWord application
var word = new ActiveXObject('Word.Application')
// Open the document
var doc = word.Documents.Open(filepath)
// Save the DOCX as an HTML file (the 8 specifies you want to save it as an HTML document)
doc.SaveAs(filepath + '.htm', 8)

Now we should have a folder in the same directory with the image files in them.

Note: In the Word HTML the images use <v:imagedata> tags which are stored in a <v:shape> tag; for example:

<v:shape style="width: 241.5pt; height: 71.25pt;">
     <v:imagedata src="path/to/the/word/doc.docx_files/image001.png">
         ...
     </v:imagedata>
</v:shape>

I've removed the extraneous attributes and tags that Word saves.

To access the HTML using JavaScript, use an XMLHttpRequest object.

 var xmlhttp = new XMLHttpRequest()
 var html_text = ""

Because I am accessing hundreds of Word docs, I've found it is best to define the XMLHttp's onreadystatechange callback before sending the call.

// Define the onreadystatechange callback function
xmlhttp.onreadystatechange = function() {
    // Check to make sure the response has fully loaded
    if (xmlhttp.readyState==4 && xmlhttp.status==200) {
        // Grab the response text
        var html_text=xmlhttp.responseText
        // Load the HTML into the innerHTML of a DIV to add the HTML to the DOM
        document.getElementById('doc_html').innerHTML=html_text.replace("<html>", "").replace("</html>","")
        // Define a new array of all HTML elements with the "v:imagedata" tag
        var images =document.getElementById('doc_html').getElementsByTagName("v:imagedata")
        // Loop through each image
        for(j=0;j<images.length;j++) {
            // Grab the source attribute to get the image name
            var src = images[j].getAttribute('src')
            // Check to make sure the image has a 'src' attribute
            if(src!=undefined) {
                ...

I've had many issues loading the correct src attribute because of the way IE escapes it's HTML attributes when it loads them into the innerHTML doc_html div so in the below example I am using a pseudo-path and src.split('/')[1] to grab the image name (this method won't work if there are more than 1 forward slashes!):

                ...
                images[j].setAttribute('src', '/path/to/the/folder/containing/the/images/'+src.split('/')[1])
                ...

Here is where we add a new img tag to the HTML div using the parent's (the v:shape object) parent (happens to be a p object). We append the new img tag to the innerHTML by grabbing the src attribute from the image and the style information from the v:shape element:

                ...
                images[j].parentElement.parentElement.innerHTML+="<img src='"+images[j].getAttribute('src')+"' style='"+images[j].parentElement.getAttribute('style')+"'>"

            }
        }       
    }
}
// Read the HTML Document using XMLHttpRequest
xmlhttp.open("POST", filepath + '.htm', false)
xmlhttp.send()

Although it is a bit specific, the above method was able to successfully add img tags to the HTML where they were in the original document.

来源：https://stackoverflow.com/questions/15212895/how-to-extract-images-from-word-documents-using-javascript

标签

javascript

activexobject