iTextSharp library does not extract text from my file

后端 未结 2 1237
清歌不尽
清歌不尽 2020-12-22 08:26

iTextSharp library (version 5.5.5) does not extract text from my file. I can copy and paste text from pdf into Notepad. I uploaded file to this link.

The source code

相关标签:
2条回答
  • 2020-12-22 09:11

    The PDF declarations of the Asian fonts in your sample PDF do not contain a ToUnicode map to allow mapping from character codes to Unicode.

    Furthermore, their encoding is Identity-H which is kind of a pseudo-encoding as it merely maps 2-byte character codes ranging from 0 to 65,535 to the same 2-byte CID value, so this still doesn't define a fixed encoding usable for text extraction.

    Identity-H may actually only be used with CIDFonts using any Registry, Ordering, and Supplement values, and these ROS values convey the actual encoding information from which a mapping to Unicode can be derived. This is the case in your file.

    To make use of these ROS values during text extraction, iText needs a set of resource files defining the mappings for the different predefined ROS values. As these files are quite huge, they are not part of the standard iText main distribution jar/dll but have to be added to the class path as a separate jar/dll file.

    I only tested this using the Java version of iText as I am more proficient with it.

    iText 5.x/Java

    The Maven coordinates for the 5.x version of this jar artifact:

    <dependency>
        <groupId>com.itextpdf</groupId>
        <artifactId>itext-asian</artifactId>
        <version>5.2.0</version>
    </dependency>
    

    (As nothing has changed in these resources in the course of the recent years, there have been no 5.x releases since 5.2.0.)

    After I added that jar to the classpath here, I could successfully extract Asian characters from your PDF. Whether they are 100% correct, I cannot say as I cannot read them.

    iTextSharp 5.x/.Net

    There should be a similar iTextSharp DLL with Asian font resources. (I found the iText 7 variant thereof but I am not sure that that works with a 5.x iTextSharp.)

    Googl'ing around one finds a number of iTextAsian-*, iTextAsianCmaps-*, and iTextAsian-all-* files... I don't know, though, which of them work with the current iTextSharp 5.5.12.

    As the OP found out, one additionally has to register the DLLs for iTextSharp (in contrast to iText / Java):

    Here is how to notify iTextSharp that Asian dlls are in the project. You need to add static constructor of yours text extraction class:

    static PdfDocument()
    {
        iTextSharp.text.io.StreamUtil.AddToResourceSearch("iTextAsian.dll");    
        iTextSharp.text.io.StreamUtil.AddToResourceSearch("iTextAsianCmaps.dll");
    }
    
    0 讨论(0)
  • 2020-12-22 09:18

    I have addition to the answer given by @mkl. Here is how to notify iTextSharp that Asian dlls are in the project. You need to add static constructor of yours text extraction class:

    static PdfDocument()
    {
        iTextSharp.text.io.StreamUtil.AddToResourceSearch("iTextAsian.dll");    
        iTextSharp.text.io.StreamUtil.AddToResourceSearch("iTextAsianCmaps.dll");
    }
    
    0 讨论(0)
提交回复
热议问题