PDFBox adding white spaces within words

前端 未结 2 1148
轮回少年
轮回少年 2020-12-10 04:03

When I try to extract text from my PDF files, it seems to insert white spaces between severl words randomly.

I am using pdfbox-app-1.6.0.jar (latest version) on fol

2条回答
  •  忘掉有多难
    2020-12-10 04:33

    The class org.apache.pdfbox.util.PDFTextStripper (pdfbox-1.7.1) allows to modify the propensity to decide if two strings are part of the same word or not.

    Increasing spacingTolerance will reduce the number of inserted spaces.

    /**
     * Set the space width-based tolerance value that is used
     * to estimate where spaces in text should be added.  Note that the
     * default value for this has been determined from trial and error.
     * Setting this value larger will reduce the number of spaces added. 
     * 
     * @param spacingToleranceValue tolerance / scaling factor to use
     */
    public void setSpacingTolerance(float spacingToleranceValue) {
        this.spacingTolerance = spacingToleranceValue;
    }
    

提交回复
热议问题