iTextSharp library does not extract text from my file

后端未结

关注

 2  1278

iTextSharp library (version 5.5.5) does not extract text from my file. I can copy and paste text from pdf into Notepad. I uploaded file to this link.

The source code

相关标签:

2条回答

被撕碎了的回忆

2020-12-22 09:11
The PDF declarations of the Asian fonts in your sample PDF do not contain a ToUnicode map to allow mapping from character codes to Unicode.

Furthermore, their encoding is Identity-H which is kind of a pseudo-encoding as it merely maps 2-byte character codes ranging from 0 to 65,535 to the same 2-byte CID value, so this still doesn't define a fixed encoding usable for text extraction.

Identity-H may actually only be used with CIDFonts using any Registry, Ordering, and Supplement values, and these ROS values convey the actual encoding information from which a mapping to Unicode can be derived. This is the case in your file.

To make use of these ROS values during text extraction, iText needs a set of resource files defining the mappings for the different predefined ROS values. As these files are quite huge, they are not part of the standard iText main distribution jar/dll but have to be added to the class path as a separate jar/dll file.

I only tested this using the Java version of iText as I am more proficient with it.

iText 5.x/Java

The Maven coordinates for the 5.x version of this jar artifact:
```
<dependency>
    <groupId>com.itextpdf</groupId>
    <artifactId>itext-asian</artifactId>
    <version>5.2.0</version>
</dependency>
```
(As nothing has changed in these resources in the course of the recent years, there have been no 5.x releases since 5.2.0.)

After I added that jar to the classpath here, I could successfully extract Asian characters from your PDF. Whether they are 100% correct, I cannot say as I cannot read them.

iTextSharp 5.x/.Net

There should be a similar iTextSharp DLL with Asian font resources. (I found the iText 7 variant thereof but I am not sure that that works with a 5.x iTextSharp.)

Googl'ing around one finds a number of iTextAsian-*, iTextAsianCmaps-*, and iTextAsian-all-* files... I don't know, though, which of them work with the current iTextSharp 5.5.12.

As the OP found out, one additionally has to register the DLLs for iTextSharp (in contrast to iText / Java):
Here is how to notify iTextSharp that Asian dlls are in the project. You need to add static constructor of yours text extraction class:
```
static PdfDocument()
{
    iTextSharp.text.io.StreamUtil.AddToResourceSearch("iTextAsian.dll");    
    iTextSharp.text.io.StreamUtil.AddToResourceSearch("iTextAsianCmaps.dll");
}
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
被撕碎了的回忆

2020-12-22 09:18
I have addition to the answer given by @mkl. Here is how to notify iTextSharp that Asian dlls are in the project. You need to add static constructor of yours text extraction class:
```
static PdfDocument()
{
    iTextSharp.text.io.StreamUtil.AddToResourceSearch("iTextAsian.dll");    
    iTextSharp.text.io.StreamUtil.AddToResourceSearch("iTextAsianCmaps.dll");
}
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

iTextSharp library does not extract text from my file

iText 5.x/Java

iTextSharp 5.x/.Net