问题
I'm trying to count pages from a word document with java.
This is my actual code, i'm using the Apache POI libraries
String path1 = "E:/iugkh";
File f = new File(path1);
File[] files = f.listFiles();
int pagesCount = 0;
for (int i = 0; i < files.length; i++) {
POIFSFileSystem fis = new POIFSFileSystem(new FileInputStream(files[i]));
HWPFDocument wdDoc = new HWPFDocument(fis);
int pagesNo = wdDoc.getSummaryInformation().getPageCount();
pagesCount += pagesNo;
System.out.println(files[i].getName()+":\t"+pagesNo);
}
The output is:
ten.doc: 1
twelve.doc: 1
nine.doc: 1
one.doc: 1
eight.doc: 1
4teen.doc: 1
5teen.doc: 1
six.doc: 1
seven.doc: 1
And this is not what i expected, as the first three documents' page length is 4 and the other are from 1 to 5 pages long.
What am i missing?
Do i have to use another library to count the pages correctly?
Thanks in advance
回答1:
This may help you. It counts the number of form feeds (sometimes used to separate pages), but I'm not sure if it's gonna work for all documents (I guess it does not).
WordExtractor extractor = new WordExtractor(document);
String[] paragraphs = extractor.getParagraphText();
int pageCount = 1;
for (int i = 0; i < paragraphs.length; ++i) {
if (paragraphs[i].indexOf("\f") >= 0) {
++pageCount;
}
}
System.out.println(pageCount);
回答2:
This alas is a bug some versions of Word (pre-2010 versions apparently, possibly just in Word 9.0 aka 2000) or at least in some versions of the COM previewer that's used to count the pages. The apache devs refused to implement a workaround for it: https://issues.apache.org/jira/browse/TIKA-1523
In fact when you open the file in Word, it of course shows the real pages and it also recalculates the count, but initially it also shows "1". But here, the metadata as saved in the file is simply "1" or maybe nothing (see below). POI does not "reflow" the layout to calculate that information.
This is why the metadata is only updated by the word processing program on opening and editing the file. If you instruct Word 2010 to open the file "read only" (which it does because its downloaded from internet), it shows "" in the page column. See 2nd screenshot. So clearly a bug in this file, not TIKA's or POI's issue.
I also found in there that the bug (for Word 9.0/2000) was confirmed by MS: http://support.microsoft.com/kb/212653/en-us
If opening and re-closing with a new version of Word is not possible/available, another workaround would be to covert the documents to pdf (or even xps) and count the pages of that.
来源:https://stackoverflow.com/questions/16442347/counting-pages-in-a-word-document