Programmatically reading a Microsoft Word document

折月煮酒 提交于 2019-12-12 10:51:42

问题


I have my students submit their Microsoft Word assignments to a ColdFusion 10 server. I'd like to write an error checker to check for common mistakes like not having a page number in the header, the name of the school on the title page, their name on the title page, etc. I specify a lot of APA rules. Example: The phrase "Running head:" must be in the header section of page 1 but not the rest of the paper. I assign a point value to each rule.

Ideally, this error checker would run when they submit the assignment and tell them immediately. That might require using

parser.parseFromString(str, "text/xml");

But as an alternate, if I could write a program that I run to check for errors, that could help automate my grading. In other words, using Microsoft Access or Visual Studio. But I don't want to do that because then I'd have to have Visual Studio on the server and I don't think that's going to be feasible.

The last option would be to download all the papers off the server and run a program locally, which is one step better than grading everything manually.


回答1:


I did this a few years back using VBA, refer to this article. Here is an excerpt that parses each paragraph of a document:

Public Sub ParseLines()
    Dim singleLine As Paragraph
    Dim lineText As String

    For Each singleLine In ActiveDocument.Paragraphs
        lineText = singleLine.Range.Text

        '// parse the text here...

    Next singleLine
End Sub



回答2:


I know you already found an answer, but I thought Id throw in Apache POI to extract the data from the word document. I know you can get the headers of the pages like so

fis = createObject("java","java.io.FileInputStream").init(ExpandPath('./mydoc.docx'));
document = createObject("java","org.apache.poi.xwpf.usermodel.XWPFDocument").init(fis);
fis.close();

policy = document.getHeaderFooterPolicy();
firstHeader = policy.getFirstPageHeader().getText();
defaultHeader = policy.getDefaultHeader().getText();

I know this is only the header portion of your request. There is a way to get the even and odd page headers too looking at the documentation.

APACHE POI




回答3:


Try out:

http://docxextractor.riaforge.org/

I extracts all clear and some of the formatting

Disclaimer: I wrote it



来源:https://stackoverflow.com/questions/14655315/programmatically-reading-a-microsoft-word-document

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!