Removing text from PDF

早过忘川 提交于 2019-11-28 10:43:50

问题


I'm looking for a solution to remove/delete ALL text from a pdf. I've been using iTextSharp for a while now, and extracting text from a pdf with it is easy (wihouth the use of OCR). However I can't find an option to delete the text.

This solution frankly doesn't work for me.

    page.GetAsArray(PdfName.CONTENTS);

returns null for me, also when using PdfName.Text and some others I've tried.

The library to use doesn't really matter, I just think iTextsharp should be able to do this. However if there is another (free) solution, bring it

EDIT: Just to make clear why I want to remove all text from the pdfs

I want to reduce the size of the pdf's. I do this by reducing the resolution of the images in the pdf. However, in alot of cases the vector images take up most of the space. So I thought of the following: Remove all text, than convert the remaining pdf (with only the images and vectors) to a bitmap (jpeg). After that I paste the text over it again. Another option would be to make the text invisible, but I don't think this is any easier.


回答1:


  1. The /Contents of a page dictionary doesn't always consist of an array. It should be evident that GetAsArray() returns null if the content is stored as a stream.
  2. Suppose you use GetAsStream() and you remove all the text contents from the stream, then you may still have text content in XObjects. That text won't be referenced from a content stream, but iText won't be able to remove the XObjects as 'unused objects' because the objects will still be referenced from the /Resources in the page dictionary.

Please read ISO-32000-1 to find out what you're doing wrong.




回答2:


Now that you've updated your question, and revealed the motivation of the intended measure, let me tell you the truth:

  • These measures will in no way reduce the size of PDFs.

  • Instead they'll lead to a hugely increased file:

    1. First removing text + fonts may lead to a slight shrinking of the size, yes.

    2. Then converting the remains of the page to a bitmap will certainly increase the size hugely (or you agree with very low image quality, maybe?).

    3. At last 'pasting' text over it again will increase the file size again (very likely by the same amount you saved in the first step).

It's not a good plan at all.

If you provide (a link to) one of your typical sample PDF file I can probably come up with a Ghostscript (plus other tools) command line that works out of the box and shrinks the PDF size more efficiently.




回答3:


To remove all text in a PDF, the easiest solution is using ghostcript

gs -o output_no_text.pdf -sDEVICE=pdfwrite -dFILTERTEXT  input.pdf


来源:https://stackoverflow.com/questions/12674195/removing-text-from-pdf

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!