发表新帖

发表新帖

Copy+pasting text from PDF results in garbage

后端未结

关注

 7  2315

无人及你 2021-02-20 00:37

I am writing a Master\'s thesis - NLP system. I have one component - extractor.

It is extracting a plain text from PDF files. There are a few PDF files that can not be

7条回答

没有蜡笔的小新 (楼主)

2021-02-20 01:24

If are able to successfully select and copy the text in Adobe Reader -- indicated that the PDF does contain text objects -- but you can't paste the copied text into Notepad without it looking like a bunch of garbage characters, then the problem is probably related to the CMap that the selected text uses.

The PDF specification provides many options for the display of textual content and the related extraction of the text content. A CMap specifies the mapping from character codes to character selectors. The PDF spec outlines some predefined CMaps, but other CMaps can also be embedded.

My guess is that either the CMap for this text is corrupt or that the PDFBox library doesn't support this particular CMap. I suggest trying a different SDK just to see if you get any different results.

0 讨论(0)

查看其它7个回答
发布评论:

提交评论
- 加载中...

热议问题