Not able to copy special character from pdf

问题

I just need to copy some text including special character from pdf file , but special character like dash(-) get's converted into 2.

PFA from below link

http://www.fileconvoy.com/dfl.php?id=g6a3426746a10af3b9992384375c5923396bce3660

Attachment have pdf source file from where I have to copy data , and other is screenshot image.. Need urgent help.I have also tried to copy data from pdf using Google Docs and Adobe Pro , but similar result I get every time.

回答1:

In a nutshell:

All information in your PDF indicates that the glyphs in your PDF you see as dashes actually indeed represent a two. Thus, to interpret those glyphs differently you have to either fundamentally change the value-to-unicode mappings for that character in its font in your PDF or resort to optical character recognition.

In detail:

Let's look into that part of your PDF pg_0001.pdf's content stream from which the words marked by you

are created:

0 -1.1065 TD
[(Fibroblast)-241.2(growth)-234.1(factor-21)-237.3(\(FGF-21\))-242.3(activity)-233.9(in)-237(High-fat)-237.9(diet)-234.9(\(HFD\))-238.3(fed)-234(ApoE)]TJ
/F6 1 Tf
6.7246 0 0 5.9768 357.3354 542.4944 Tm
(2)Tj
/F4 1 Tf
.8346 0 TD
(/)Tj
/F6 1 Tf
.3372 0 TD
(2)Tj
/F4 1 Tf
8.9663 0 0 8.9663 372.9826 538.5259 Tm
[(mice)-235.6(with)-233.5(adiponectin)-240.8(\(Acrp30\))-237.6(knockdown.)]TJ

Your special characters here indeed are each represented by the character '2' (= 50 = 0x32) from the font /F6.

As the mapping from character in the string here to actually printed glyph may be quite arbitrary and there may be hints for the correct interpretation, though, we should look into the definition of that font /F6 on that page:

<<
  /FirstChar 44
  /ToUnicode 21 0 R
  /Encoding 22 0 R
  /FontDescriptor 23 0 R
  /BaseFont /KAHBDA+AdvP7DA6
  /Subtype /Type1
  /LastChar 50
  /Type /Font
  /Widths [833 0 0 0 0 0 833]
>>

So your font is enhanced by a /ToUnicode mapping which text extracting programs should use to interpret the characters in the content stream. Let's look at that mapping:

/CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo <<
/Registry (F6+0) /Ordering (T1UV) /Supplement 0 >> def
/CMapName /F6+0 def
/CMapType 2 def
1 begincodespacerange <2c> <32> endcodespacerange
2 beginbfchar
<2c> <002C>
<32> <0032>
endbfchar
endcmap CMapName currentdict /CMap defineresource pop end end

Thus the '2' = 0x32 here is mapped to <0032> representing the Unicode code 0x0032 which once again is '2'.

If the /ToUnicode mapping was not present, a text extracting program could instead have used the /Encoding definition in the PDF object 22 0. But here again:

22 0 obj 
<<
  /Type /Encoding
  /Differences [44 /comma 50 /two]
>>

Here the '2' = 50 is mapped to the glyph named /two which once again makes that glyph a two.

Thus, all information in your PDF short of the glyph drawing definition itself (which could theoretically be checked by OCR'ing) indicates that dash glyph is indeed a two.

To make a text extraction program interpret that glyph more to your liking, you should replace the /ToUnicode mappings of <32> to e.g. <002D>. Unfortunately that mapping is encoded (with filter /FlateDecode), thus that's no easy hex editor job but instead requires decoding etc...

来源：https://stackoverflow.com/questions/15264616/not-able-to-copy-special-character-from-pdf

标签

pdf

Ubuntu

text

copy