Copy+pasting text from PDF results in garbage

后端 未结 7 2316
无人及你
无人及你 2021-02-20 00:37

I am writing a Master\'s thesis - NLP system. I have one component - extractor.

It is extracting a plain text from PDF files. There are a few PDF files that can not be

7条回答
  •  心在旅途
    2021-02-20 01:25

    The best way to deal with this is (assuming you have Adobe Acrobat, or something similar, not sure if Reader can do this) is save the doc as a JPEG. Then recompile all the images as a single pdf, then use the OCR function to find text in the pages, then you can copy and paste the text.

提交回复
热议问题