Counting pages in a Word document

戏子无情 提交于 2019-12-07 13:56:06

问题


I'm trying to count pages from a word document with java.

This is my actual code, i'm using the Apache POI libraries

String path1 = "E:/iugkh";
File f = new File(path1);
File[] files = f.listFiles();
int pagesCount = 0;
for (int i = 0; i < files.length; i++) {
    POIFSFileSystem fis = new POIFSFileSystem(new FileInputStream(files[i]));
    HWPFDocument wdDoc = new HWPFDocument(fis);
    int pagesNo = wdDoc.getSummaryInformation().getPageCount();
    pagesCount += pagesNo;
    System.out.println(files[i].getName()+":\t"+pagesNo);
}

The output is:

ten.doc:    1
twelve.doc: 1
nine.doc:   1
one.doc:    1
eight.doc:  1
4teen.doc:  1
5teen.doc:  1
six.doc:    1
seven.doc:  1

And this is not what i expected, as the first three documents' page length is 4 and the other are from 1 to 5 pages long.

What am i missing?

Do i have to use another library to count the pages correctly?

Thanks in advance


回答1:


This may help you. It counts the number of form feeds (sometimes used to separate pages), but I'm not sure if it's gonna work for all documents (I guess it does not).

WordExtractor extractor = new WordExtractor(document);
String[] paragraphs = extractor.getParagraphText();

int pageCount = 1;
for (int i = 0; i < paragraphs.length; ++i) {
    if (paragraphs[i].indexOf("\f") >= 0) {
        ++pageCount;
    }
}

System.out.println(pageCount);



回答2:


This alas is a bug some versions of Word (pre-2010 versions apparently, possibly just in Word 9.0 aka 2000) or at least in some versions of the COM previewer that's used to count the pages. The apache devs refused to implement a workaround for it: https://issues.apache.org/jira/browse/TIKA-1523

In fact when you open the file in Word, it of course shows the real pages and it also recalculates the count, but initially it also shows "1". But here, the metadata as saved in the file is simply "1" or maybe nothing (see below). POI does not "reflow" the layout to calculate that information.

This is why the metadata is only updated by the word processing program on opening and editing the file. If you instruct Word 2010 to open the file "read only" (which it does because its downloaded from internet), it shows "" in the page column. See 2nd screenshot. So clearly a bug in this file, not TIKA's or POI's issue.

I also found in there that the bug (for Word 9.0/2000) was confirmed by MS: http://support.microsoft.com/kb/212653/en-us

If opening and re-closing with a new version of Word is not possible/available, another workaround would be to covert the documents to pdf (or even xps) and count the pages of that.



来源:https://stackoverflow.com/questions/16442347/counting-pages-in-a-word-document

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!