If identifying text structure in PDF documents is so difficult, how do PDF readers do it so well?

前端 未结 2 1694
遇见更好的自我
遇见更好的自我 2020-12-04 10:23

I have been trying to write a simple console application or PowerShell script to extract the text from a large number of PDF documents. There are several libraries and CLI t

2条回答
  •  情书的邮戳
    2020-12-04 10:54

    To properly extract formatted text a library/utility should:

    1. Retrieve correct information about properties of the fonts used in the PDF (glyph sizes, hinting information etc.)
    2. Maintain graphics state (i.e. non-font parameters like text and page scaling etc.)
    3. Implement some algorithm to decide which symbols on a page should be treated like words, lines or columns.

    I am not really an expert in products you mentioned in your question, so the following conclusions should be taken with a grain of salt.

    The tools that do not draw PDFs tend to have less expertise in the first two requirements. They have not have to deal with font details on a deeper level and they might not be that well tested in maintaining graphics state.

    Any decent tool that translates PDFs to images will probably become aware of its shortcomings in text positioning sooner or later. And fixing those will help to excel in text extraction.

提交回复
热议问题