How to load text of MS Word document in C# (.NET)?

问题

How do I load MS Word document (.doc and .docx) to memory (variable) without doing this?:

wordApp.Documents.Open

I don't want to open MS Word, I just want that text inside.

You gave me answer for DOCX, but what about DOC? I want free and high performance solution - not to open 12.000 instances of Word to process all of them. :( Aspose is commercial product, and 900$ is a way too much for what I do.

回答1:

You can use wordconv.exe which is part of the Office Compatibility Pack to convert from doc to docx.

http://www.microsoft.com/downloads/details.aspx?familyid=941b3470-3ae9-4aee-8f43-c6bb74cd1466&displaylang=en

Just call the command like so: "C:\Program Files\Microsoft Office\Office12\wordconv.exe" -oice -nme InputFile OutputFile

I'm not sure if you need word installed for it to run but it does work. I use it locally as a windows shell command to convert old office files to 2007 format whenever I want.

回答2:

For docx formatted Word Documents I found this interesting article on The CodeProject

Using DocxToText to Extract Text from DOCX Files

In the article the author discusses stripping out just the words themselves.

For your doc (non-docx) Word Documents other than using the Office APIs and (in the background) spawning an instance of Word you could try shelling out to one of the many different Doc2Docx converters on the market and then applying the above process for both.

回答3:

If you are dealing with docx you can do this with out doing any interop with Word .docx file actually a ZIP contains an XML file , you can read the XML Please refer the below links

http://conceptdev.blogspot.com/2007/03/open-docx-using-c-to-extract-text-for.html

Office (2007) Open XML File Formats

回答4:

I recently did some research on this topic. It turns out that to be able to manipulate word files programatically without opening word itself you need some very expensive tools.

There's an article over at code project on manipulating Word, you might find it useful. The author build a C# COM wrapper for dealing with calls to Word. It looks like it actually pops open the word application though.

This post over at the neowin forums looks promising too. It includes quite a few PInvoked calls for the purpose of text extraction.

Maybe if you could find a way to keep the window hidden it would be acceptable.

回答5:

Aspose has a component to read, modify and write Word documents. Here is the product link : Aspose.Words for .NET and Java

Aspose.Words enables .NET and Java applications to read, modify and write Word® documents without utilizing Microsoft Word®. Aspose.Words supports a wide array of features including document creation, content and formatting manipulation, powerful mail merge abilities, comprehensive support of DOC, OOXML, RTF, WordprocessingML, HTML, OpenDocument and PDF formats. Aspose.Words is truly the most affordable, fastest and feature rich Word component on the market.

回答6:

With docxtemplater, you can easily get the full text of a word (works with docx only).

Here's the code (Node.JS)

DocxTemplater=require('docxtemplater'); doc=new DocxTemplater().loadFromFile("input.docx"); result=doc.getFullText();

This is just three lines of code and doesn't depend on any word instance (all plain JS)

回答7:

I don't mean to be an antagonist, but why?

I've extracted data from Word Documents on Linux servers using Word2X or AbiWord and depending on the number and the variety of docments there will always be errors with the extraction. It's worse the more bullets, page breaks, document sections and other "special" features there are.

I understand there are options now to automate OpenOffice to process documents, but my advice is, if you can, just use Word to process Word documents.

来源：https://stackoverflow.com/questions/215620/how-to-load-text-of-ms-word-document-in-c-sharp-net

标签

.net

ms-word

docx

doc