My question is:
How can I extract text from a PDF file which is divided in columns in a way that I get the result separated by this columns?
Background: I wo
Combined with @mkl's answer, I used PDFbox to complete the extraction of text by columns.
The way I find the boundary of two columns is trying constantly. =:
StringBuilder pdfText = new StringBuilder();
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
Rectangle rectLeft = new Rectangle(10, 60, 320, 820);
Rectangle rectRight = new Rectangle(330, 60, 320, 820);
stripper.addRegion("leftColumn", rectLeft);
stripper.addRegion("rightColumn", rectRight);
PDPageTree allPages = document.getDocumentCatalog().getPages();
int pageNumber = document.getNumberOfPages();
String leftText = "";
String rightText = "";
for (int i = 0; i < pageNumber; i++) {
PDPage page = (PDPage) allPages.get(i);
stripper.extractRegions(page);
leftText = stripper.getTextForRegion("leftColumn");
rightText = stripper.getTextForRegion("rightColumn");
pdfText.append(leftText);
pdfText.append(rightText);
}