extract PDF text by columns

问题

My question is:

How can I extract text from a PDF file which is divided in columns in a way that I get the result separated by this columns?

Background: I work on a project about text analyses (especially scientific texts). These texts sometimes are published in muliple column layouts with each column given a separate page number. To order the extracted text by the layouted pagenumbers it would be useful to extract the text by columns.

I use pdfBox and tried / searched for several things:

I tried the getThreadBeads() method of the PDPage class -> result: list with 0 size
I tried graping the text with the getCharactersByArticle() method -> text not divided in columns
(I tried this with pdf files of published texts as well as with self created .doc based files, each have a multiple column layout)

The thing is that pdfBox seems to divide the text by columns automatically: If I set setSortByPosition() of a PDFTextStripper on true all signs of a page are set in a line without recognizing separate columns. But if I set setSortByPosition() on false the stripper is doing this division.

For that I had a look to the pdfBox source code: The crucial method is the writePage() method of PDFTextStripper. Here spaces (which are not given in most pdfs) and line breaks are calculated obviously. But I couldn't find how the Stripper is calculating the column breaks.

So the questions again:

How is PDFTextStripper calculating column breaks?
Are there methods in the pdfBox API to catch this / to extract the text by columns?
Is this possible with other pdf-api?

thanks in advance

回答1:

If I set setSortByPosition() of a PDFTextStripper on true all signs of a page are set in a line without recognizing separate columns. But if I set setSortByPosition() on false the stripper is doing this division.

[...] How is PDFTextStripper calculating column breaks?

It isn't.

By setting SortByPosition to false you tell PDFBox to not try to sort the text pieces from the page content stream but to instead accept them in the order they appear.

In your document the text pieces seem to be drawn in the reading order, i.e. column by column. This is not true for all documents, and to cope with other documents PDFBox offers the option of sorting the text pieces left-to-right, top-to-bottom.

Activating that option (setting SortByPosition to true) in your document returns the text without respect to the columns.

Are there methods in the pdfBox API to catch this / to extract the text by columns?

PDFBox does not analyze the page content to recognize columns. If you do the analysis, though, it allows you to extract text column by column if you provide the column rectangles as reguions.

回答2:

Combined with @mkl's answer, I used PDFbox to complete the extraction of text by columns.

The way I find the boundary of two columns is trying constantly. =:

    StringBuilder pdfText = new StringBuilder();
    PDFTextStripperByArea stripper = new PDFTextStripperByArea();

    stripper.setSortByPosition(true);

    Rectangle rectLeft = new Rectangle(10, 60, 320, 820);

    Rectangle rectRight = new Rectangle(330, 60, 320, 820);

    stripper.addRegion("leftColumn", rectLeft);

    stripper.addRegion("rightColumn", rectRight);

    PDPageTree allPages = document.getDocumentCatalog().getPages();
    int pageNumber = document.getNumberOfPages();


    String leftText = "";
    String rightText = "";

    for (int i = 0; i < pageNumber; i++) {

        PDPage page = (PDPage) allPages.get(i);

        stripper.extractRegions(page);
        leftText = stripper.getTextForRegion("leftColumn");
        rightText = stripper.getTextForRegion("rightColumn");

        pdfText.append(leftText);
        pdfText.append(rightText);


    }

来源：https://stackoverflow.com/questions/26233387/extract-pdf-text-by-columns

标签

pdf

pdfbox