How to cut-paste from PDF with non-ASCII encoding?

问题

I have some PDFs and I am trying to cut and paste text they contain from Acrobat Reader into an HTML form. It seems that some of these files use (I suspect) unicode for text encoding, so when I try to paste into the HTML form (on firefox) I get the little boxes with hex chars in them rather than readable text. The problem is not that the PDF has not been OCRed -- when I try to do that in Acrobat Pro it says it can't because the file already contains renderable text. Is there any way to deal with this? For example could I add some sort of javascript to the form that would do conversion?

回答1:

Are you able to paste text copied from the file into other programs like Notepad or Word or any other?

Some PDF files are produced without special information that is crucial for successful extraction of text from them. Even by the Adobe tools. Basically, such files do not contain glyph-to-character mapping information.

Such files will be displayed and printed just fine, but text from them can't be properly copied / extracted.

For example, Distiller produces such files when "Smallest File Size" preset is used.

回答2:

I have the same problem... Indeed it is explained here: http://forums.adobe.com/thread/915012

My solution was to convert the pdf to Word using the Exporting Tool of Acrobat and then extract the information I need from it.

It's frustrating but that work.

Another solution that I find is to convert the pdf in images (jpeg, png, etc) and then run an OCR process.

回答3:

It is quite possible that the text contains characters that get copied correctly but your browser is unable to display them, due to lack of suitable font. A PDF document may contain embedded fonts, so Adobe Reader displays the characters OK, but a browser lacks access to those fonts.

You can check whether this is the reason by trying to copy and paste the characters here (it might be useful info about the problem anyway). You could also download and install the Code200x fonts, which contain pretty much any character you can normally expect to encounter. (It is not guaranteed, but probable, that Firefox will be able to use those fonts automatically when needed.)

回答4:

Select the text in Acrobat.
Right-click and select "Copy with formatting" from the context menu.
Wait for the progress bar to process the text.
Paste in the Word document.

回答5:

We had similar problem trying to copy/paste cyrillics from a PDF file into Excel.

The easiest solution we found was to open the .pdf with a browser (Chrome, Mozilla or Opera) and copy/paste the text in Word, Excel.

It didn't work with IE, as expected.

回答6:

I had the same problem but I solved it by opening the PDF file with the web-browser (chrome in my case). Copy-and-pasting non-ASCII encoding works fine in chrome.

回答7:

You can export from acrobat as jpeg, then open the jpeg in acrobat (not reader) then run the OCR tool. From there you should be able to copy/paste.

来源：https://stackoverflow.com/questions/9143154/how-to-cut-paste-from-pdf-with-non-ascii-encoding

标签

pdf

unicode

acrobat