问题
I'm using PDFbox to extract the coordinates of words/strings in a PDF document, and have so far had success determining the position of individual characters. this is the code thus far, from the PDFbox doc:
package printtextlocations;
import java.io.*;
import org.apache.pdfbox.exceptions.InvalidPasswordException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDStream;
import org.apache.pdfbox.util.PDFTextStripper;
import org.apache.pdfbox.util.TextPosition;
import java.io.IOException;
import java.util.List;
public class PrintTextLocations extends PDFTextStripper {
public PrintTextLocations() throws IOException {
super.setSortByPosition(true);
}
public static void main(String[] args) throws Exception {
PDDocument document = null;
try {
File input = new File("C:\\path\\to\\PDF.pdf");
document = PDDocument.load(input);
if (document.isEncrypted()) {
try {
document.decrypt("");
} catch (InvalidPasswordException e) {
System.err.println("Error: Document is encrypted with a password.");
System.exit(1);
}
}
PrintTextLocations printer = new PrintTextLocations();
List allPages = document.getDocumentCatalog().getAllPages();
for (int i = 0; i < allPages.size(); i++) {
PDPage page = (PDPage) allPages.get(i);
System.out.println("Processing page: " + i);
PDStream contents = page.getContents();
if (contents != null) {
printer.processStream(page, page.findResources(), page.getContents().getStream());
}
}
} finally {
if (document != null) {
document.close();
}
}
}
/**
* @param text The text to be processed
*/
@Override /* this is questionable, not sure if needed... */
protected void processTextPosition(TextPosition text) {
System.out.println("String[" + text.getXDirAdj() + ","
+ text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale="
+ text.getXScale() + " height=" + text.getHeightDir() + " space="
+ text.getWidthOfSpace() + " width="
+ text.getWidthDirAdj() + "]" + text.getCharacter());
}
}
This produces a series of lines containing the position of each character, including spaces, that looks like this:
String[202.5604,41.880127 fs=1.0 xscale=13.98 height=9.68814 space=3.8864403 width=9.324661]P
Where 'P' is the character. I have not been able to find a function in PDFbox to find words, and I am not familiar enough with Java to be able to accurately concatenate these characters back into words to search through even though the spaces are also included. Has anyone else been in a similar situation, and if so how did you approach it? I really only need the coordinate of the first character in the word so that parts simplified, but as to how I'm going to match a string against that kind of output is beyond me.
回答1:
There is no function in PDFBox that allows you to extract words automatically. I'm currently working on extracting data to gather it into blocks and here is my process:
I extract all the characters of the document (called glyphs) and store them in a list.
I do an analysis of the coordinates of each glyph, looping over the list. If they overlap (if the top of the current glyph is contained between the top and bottom of the preceding/or the bottom of the current glyph is contained between the top and bottom of the preceding one), I add it to the same line.
At this point, I have extracted the different lines of the document (be careful, if your document is multi-column, the expression "lines" means all the glyphs that overlap vertically, ie the text of all the columns that have the same vertical coordinates).
Then, you can compare the left coordinate of the current glyph to the right coordinate of the preceding one to determine if they belong to the same word or not (the PDFTextStripper class provides a getSpacingTolerance() method that gives you, based on trials and errors, the value of a "normal" space. If the difference between the right and the left coordinates is lower than this value, both glyphs belong to the same word.
I applied this method to my work and it works good.
回答2:
Based on the original idea here is a version of the text search for PDFBox 2. The code itself is rough, but simple. It should get you started fairly quickly.
import java.io.IOException;
import java.io.Writer;
import java.util.List;
import java.util.Set;
import lu.abac.pdfclient.data.PDFTextLocation;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
public class PrintTextLocator extends PDFTextStripper {
private final Set<PDFTextLocation> locations;
public PrintTextLocator(PDDocument document, Set<PDFTextLocation> locations) throws IOException {
super.setSortByPosition(true);
this.document = document;
this.locations = locations;
this.output = new Writer() {
@Override
public void write(char[] cbuf, int off, int len) throws IOException {
}
@Override
public void flush() throws IOException {
}
@Override
public void close() throws IOException {
}
};
}
public Set<PDFTextLocation> doSearch() throws IOException {
processPages(document.getDocumentCatalog().getPages());
return locations;
}
@Override
protected void writeString(String text, List<TextPosition> textPositions) throws IOException {
super.writeString(text);
String searchText = text.toLowerCase();
for (PDFTextLocation textLoc:locations) {
int start = searchText.indexOf(textLoc.getText().toLowerCase());
if (start!=-1) {
// found
TextPosition pos = textPositions.get(start);
textLoc.setFound(true);
textLoc.setPage(getCurrentPageNo());
textLoc.setX(pos.getXDirAdj());
textLoc.setY(pos.getYDirAdj());
}
}
}
}
回答3:
take a look on this, I think it's what you need.
https://jackson-brain.com/using-pdfbox-to-locate-text-coordinates-within-a-pdf-in-java/
Here is the code:
import java.io.File;
import java.io.IOException;
import java.text.DecimalFormat;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import org.apache.pdfbox.exceptions.InvalidPasswordException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDStream;
import org.apache.pdfbox.util.PDFTextStripper;
import org.apache.pdfbox.util.TextPosition;
public class PrintTextLocations extends PDFTextStripper {
public static StringBuilder tWord = new StringBuilder();
public static String seek;
public static String[] seekA;
public static List wordList = new ArrayList();
public static boolean is1stChar = true;
public static boolean lineMatch;
public static int pageNo = 1;
public static double lastYVal;
public PrintTextLocations()
throws IOException {
super.setSortByPosition(true);
}
public static void main(String[] args)
throws Exception {
PDDocument document = null;
seekA = args[1].split(",");
seek = args[1];
try {
File input = new File(args[0]);
document = PDDocument.load(input);
if (document.isEncrypted()) {
try {
document.decrypt("");
} catch (InvalidPasswordException e) {
System.err.println("Error: Document is encrypted with a password.");
System.exit(1);
}
}
PrintTextLocations printer = new PrintTextLocations();
List allPages = document.getDocumentCatalog().getAllPages();
for (int i = 0; i < allPages.size(); i++) {
PDPage page = (PDPage) allPages.get(i);
PDStream contents = page.getContents();
if (contents != null) {
printer.processStream(page, page.findResources(), page.getContents().getStream());
}
pageNo += 1;
}
} finally {
if (document != null) {
System.out.println(wordList);
document.close();
}
}
}
@Override
protected void processTextPosition(TextPosition text) {
String tChar = text.getCharacter();
System.out.println("String[" + text.getXDirAdj() + ","
+ text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale="
+ text.getXScale() + " height=" + text.getHeightDir() + " space="
+ text.getWidthOfSpace() + " width="
+ text.getWidthDirAdj() + "]" + text.getCharacter());
String REGEX = "[,.\\[\\](:;!?)/]";
char c = tChar.charAt(0);
lineMatch = matchCharLine(text);
if ((!tChar.matches(REGEX)) && (!Character.isWhitespace(c))) {
if ((!is1stChar) && (lineMatch == true)) {
appendChar(tChar);
} else if (is1stChar == true) {
setWordCoord(text, tChar);
}
} else {
endWord();
}
}
protected void appendChar(String tChar) {
tWord.append(tChar);
is1stChar = false;
}
protected void setWordCoord(TextPosition text, String tChar) {
tWord.append("(").append(pageNo).append(")[").append(roundVal(Float.valueOf(text.getXDirAdj()))).append(" : ").append(roundVal(Float.valueOf(text.getYDirAdj()))).append("] ").append(tChar);
is1stChar = false;
}
protected void endWord() {
String newWord = tWord.toString().replaceAll("[^\\x00-\\x7F]", "");
String sWord = newWord.substring(newWord.lastIndexOf(' ') + 1);
if (!"".equals(sWord)) {
if (Arrays.asList(seekA).contains(sWord)) {
wordList.add(newWord);
} else if ("SHOWMETHEMONEY".equals(seek)) {
wordList.add(newWord);
}
}
tWord.delete(0, tWord.length());
is1stChar = true;
}
protected boolean matchCharLine(TextPosition text) {
Double yVal = roundVal(Float.valueOf(text.getYDirAdj()));
if (yVal.doubleValue() == lastYVal) {
return true;
}
lastYVal = yVal.doubleValue();
endWord();
return false;
}
protected Double roundVal(Float yVal) {
DecimalFormat rounded = new DecimalFormat("0.0'0'");
Double yValDub = new Double(rounded.format(yVal));
return yValDub;
}
}
Dependencies:
PDFBox, FontBox, Apache Common Logging Interface.
You can run it by typing on command line:
javac PrintTextLocations.java
sudo java PrintTextLocations file.pdf WORD1,WORD2,....
the output is similar to:
[(1)[190.3 : 286.8] WORD1, (1)[283.3 : 286.8] WORD2, ...]
回答4:
I got this working using the IKVM conversion PDFBox.NET 1.8.9. in C# and .NET.
I finally figured out the character (glyph) coordinates are private to the .NET assembly, but can be accessed using System.Reflection
.
I posted a full example of getting the coordinates of WORDS and drawing them back on images of PDF's using SVG and HTML here: https://github.com/tsamop/PDF_Interpreter
For the example below you need PDFbox.NET: http://www.squarepdf.net/pdfbox-in-net, and include references to it in your project.
It took me quite a while to figure it out, so I really hope it saves someone else time!!
If you just need to know where to look for the characters & coordinates, a very abridged version would be:
using System;
using System.Reflection;
using org.apache.pdfbox.pdmodel;
using org.apache.pdfbox.util;
// to test run pdfTest.RunTest(@"C:\temp\test_2.pdf");
class pdfTest
{
//simple example for getting character (gliph) coordinates out of a pdf doc.
// a more complete example is here: https://github.com/tsamop/PDF_Interpreter
public static void RunTest(string sFilename)
{
//probably a better way to get page count, but I cut this out of a bigger project.
PDDocument oDoc = PDDocument.load(sFilename);
object[] oPages = oDoc.getDocumentCatalog().getAllPages().toArray();
int iPageNo = 0; //1's based!!
foreach (object oPage in oPages)
{
iPageNo++;
//feed the stripper a page.
PDFTextStripper tStripper = new PDFTextStripper();
tStripper.setStartPage(iPageNo);
tStripper.setEndPage(iPageNo);
tStripper.getText(oDoc);
//This gets the "charactersByArticle" private object in PDF Box.
FieldInfo charactersByArticleInfo = typeof(PDFTextStripper).GetField("charactersByArticle", BIndingFlags.NonPublic | BindingFlags.Instance);
object charactersByArticle = charactersByArticleInfo.GetValue(tStripper);
object[] aoArticles = (object[])charactersByArticle.GetField("elementData");
foreach (object oArticle in aoArticles)
{
if (oArticle != null)
{
//THE CHARACTERS within the article
object[] aoCharacters = (object[])oArticle.GetField("elementData");
foreach (object oChar in aoCharacters)
{
/*properties I caulght using reflection:
* endX, endY, font, fontSize, fontSizePt, maxTextHeight, pageHeight, pageWidth, rot, str textPos, unicodCP, widthOfSpace, widths, wordSpacing, x, y
*
*/
if (oChar != null)
{
//this is a really quick test.
// for a more complete solution that pulls the characters into words and displays the word positions on the page, try this: https://github.com/tsamop/PDF_Interpreter
//the Y's appear to be the bottom of the char?
double mfMaxTextHeight = Convert.ToDouble(oChar.GetField("maxTextHeight")); //I think this is the height of the character/word
char mcThisChar = oChar.GetField("str").ToString().ToCharArray()[0];
double mfX = Convert.ToDouble(oChar.GetField("x"));
double mfY = Convert.ToDouble(oChar.GetField("y")) - mfMaxTextHeight;
//CALCULATE THE OTHER SIDE OF THE GLIPH
double mfWidth0 = ((Single[])oChar.GetField("widths"))[0];
double mfXend = mfX + mfWidth0; // Convert.ToDouble(oChar.GetField("endX"));
//CALCULATE THE BOTTOM OF THE GLIPH.
double mfYend = mfY + mfMaxTextHeight; // Convert.ToDouble(oChar.GetField("endY"));
double mfPageHeight = Convert.ToDouble(oChar.GetField("pageHeight"));
double mfPageWidth = Convert.ToDouble(oChar.GetField("pageWidth"));
System.Diagnostics.Debug.Print(@"add some stuff to test {0}, {1}, {2}", mcThisChar, mfX, mfY);
}
}
}
}
}
}
}
using System.Reflection;
/// <summary>
/// To deal with the Java interface hiding necessary properties! ~mwr
/// </summary>
public static class GetField_Extension
{
public static object GetField(this object randomPDFboxObject, string sFieldName)
{
FieldInfo itemInfo = randomPDFboxObject.GetType().GetField(sFieldName, BindingFlags.NonPublic | BindingFlags.Instance);
return itemInfo.GetValue(randomPDFboxObject);
}
}
来源:https://stackoverflow.com/questions/11873801/using-pdfbox-to-determine-the-coordinates-of-words-in-a-document