Remove all text from PDF file

北城余情 提交于 2019-11-30 05:20:46

Since my previous answer, development has continued, and a new option is available now, which justifies a new answer.

The most recent versions of Ghostscript support 3 new parameters, which allow you to remove either all TEXT, or all IMAGE or all VECTOR elements from a PDF.

To remove all TEXT elements from an input PDF, run

gs -o no-more-texts.pdf -sDEVICE=pdfwrite -dFILTERTEXT   input.pdf

To remove all raster IMAGE elements from an input PDF, run

gs -o no-more-texts.pdf -sDEVICE=pdfwrite -dFILTERIMAGE  input.pdf

To remove all VECTOR elements from an input PDF, run

gs -o no-more-texts.pdf -sDEVICE=pdfwrite -dFILTERVECTOR input.pdf

Of course, you can also combine any of above two parameters (combining all three will create empty pages.

Here are screenshots of a PDF page, where the original contained all three elements whereas the resulting pages look different.


Screenshot of original PDF page containing "image", "vector" and "text" elements.


Running the following 6 commands will create all 6 possible variations of remaining contents:

 gs -o noIMG.pdf   -sDEVICE=pdfwrite -dFILTERIMAGE                input.pdf
 gs -o noTXT.pdf   -sDEVICE=pdfwrite -dFILTERTEXT                 input.pdf
 gs -o noVCT.pdf   -sDEVICE=pdfwrite -dFILTERVECTOR               input.pdf

 gs -o onlyIMG.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERTEXT  input.pdf
 gs -o onlyTXT.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERIMAGE input.pdf
 gs -o onlyVCT.pdf -sDEVICE=pdfwrite -dFILTERIMAGE  -dFILTERTEXT  input.pdf

The following image illustrates the results:


Top row, from left: all "text" removed; all "images" removed; all "vectors" removed. Bottom row, from left: only "text" kept; only "images" kept; only "vectors" kept.


You can achieve what you want without Ghostscript, simply by using a text editor.

  1. Convert your compressed PDF into one which has (nearly) all PDF objects' contents and streams expanded into a readable form using QPDF:

    qpdf --qdf --object-streams=disable input.pdf editable.pdf
    
  2. Open your new editable.pdf file with a text editor (which also gracefully handles any remaining binary blobs inside the PDF such as font or ICC resources).

  3. Search for all occurences of TJ and Tj strings (PDF operators used to show text) inside PDF object streams and change them to the JT and jT strings respectively (undefined, nonsense PDF operators). Save the file as edited.pdf.

  4. Now convert your edited.pdf to your PNG images as needed.

Note, the edited.pdf will still display in most PDF viewers, but the text will be missing. However, it will be easy to restore the text again, by restoring the original TJ/Tj operators.


Update/Correction

My bad! My original answer contained a repeated typo. I had used tj at places where Tj should have been used. Sorry for any confusion that may have created.

Update 2

To clarify what an "object stream" is... In the "normalized" form created by the qpdf command given above, objects with streams usually look like this (where NNN is an integer number):

NNN 0 obj
<<
   % Here are the key:value pairs of the object dictionary
   /Key1 somevalue1
   /Key2 somevalue2
   % ... (more key:value pairs)
>>
stream
% Here is the content of the object stream
endstream
endobj

An "image stream" has basically the same structure. But the key:value pairs typically contain the following 4 entries, in any order (where NNN and MMM are integer values giving width and height of the image in pixels):

/Type /XObject
/Subtype /Image
/Width NNN
/Height MMM

Obviously this is not a standard requirement, but it was recently discussed on the #Ghostscript forum on IRC. The channel is logged and you can find the discussion here:

http://ghostscript.com/irclogs/2014/05/21.html

We originally suggested changing the initial text rendering mode to 3 in pdf_ops.ps, but that had no effect on the file as it was using a type 3 font. So we suggested instead altering the definitions of TJ and Tj in the same file. Look at around 15:37 in the log.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!