问题
An exception "Coordinate outside allowed range" is thrown when I try to use LocationTextExtractionStrategy
.
for (int pageNum = 1; pageNum <= document.getNumberOfPages(); pageNum++)
{
PdfPage page = document.getPage(pageNum);
sb.append(PdfTextExtractor.getTextFromPage(page, new LocationTextExtractionStrategy()));
}
More information about the exception:
java.lang.IllegalStateException: Coordinate outside allowed range
at com.itextpdf.kernel.pdf.canvas.parser.clipper.ClipperBase.rangeTest(ClipperBase.java:76)
I have 2 similar PDFs generated by the same software, in the first the exception is thrown, in the second not.
PDF 1 (exception)
PDF 2 (ok)
What is throwing this exception in the first PDF? How do I solve this without using the SimpleTextExtractionStrategy?
回答1:
(According to your stack trace you are using an iText 7.* version. I updated your question tags accordingly and reproduced the issue with the current iText 7.1.2-SNAPSHOT.)
What is throwing this exception in the first PDF?
In short
Both your PDFs contain extreme y coordinates (beyond ISO 32000-1 implementation limits) for defining clip paths, your PDF 1 merely is twice as extreme as PDF 2 and iText clip path routines start hickup'ing somewhere in between.
In detail
The page content stream of page 1 of PDF 1 essentially looks like this:
q
[...]
% modifyCTM
0.802969 0 0 -0.802969 0 842 cm
[...]
q
0 0 741 98417 re W n
[...]
Q
q
0 0 741 98417 re W n
[...]
Q
q
0 0 741 98417 re W n
[...]
Q
q
0 0 741 98417 re W n
[...]
Q
q
0 0 741 98417 re W n
[...]
Q
q
0 0 741 98417 re W n
[...]
Q
Q
Thus, even considering the initial modification of the CTM you six times define clip path rectangles with a height of 98417 * 0.802969
default user units which equal approximately 79026
default user units.
ISO 32000-1 Annex C.2 Architectural limits on the other hand indicates
conforming readers should accommodate PDF files that obey the constraints.
[...]
- The minimum page size should be 3 by 3 units in default user space; the maximum should be 14,400 by 14,400 units.
Thus, your clip path rectangle is more than five times as high as a page can be that a conforming reader is expected to support. Consequentially a conforming reader need not support your extreme clip paths.
PDF 2 is built similarly, the clip paths in question merely are 41879 * 0.802969
units high, i.e. about 33628
units, which merely is more than twice as high as needs to be supported. For some reasons iText appears to support this still.
How do I solve this without using the SimpleTextExtractionStrategy?
You can tweak iText 7 by changing the constant com.itextpdf.kernel.pdf.canvas.parser.clipper.ClipperBridge.floatMultiplier
/**
* Since the clipper library uses integer coordinates, we should convert
* our floating point numbers into fixed point numbers by multiplying by
* this coefficient. Vary it to adjust the preciseness of the calculations.
*/
public static double floatMultiplier = Math.pow(10, 14);
You can try e.g. Math.pow(10, 10)
which works for me with both your files.
That been said, ISO 32000-2 appears to have dropped this specific page size limit, there merely are more generic limits plus statements like a particular PDF processor running on a particular device and in a particular operating environment will always have practical limits.
Thus, @iText should consider whether the current limits are such practical limits or should be relaxed.
来源:https://stackoverflow.com/questions/49781516/itext-coordinate-outside-allowed-range-exception-using-locationtextlocationstr