Bad characters when replacing text in pdf using pdfbox

问题

I'm trying to replace text in pdf and it's kind of replaced, this is my code

PDDocument doc = null;
    int occurrences = 0;
    try {
        doc = PDDocument.load("test.pdf"); //Input PDF File Name
        List pages = doc.getDocumentCatalog().getAllPages();
        for (int i = 0; i < pages.size(); i++) {
            PDPage page = (PDPage) pages.get(i);
            PDStream contents = page.getContents();
            PDFStreamParser parser = new PDFStreamParser(contents.getStream());
            parser.parse();
            List tokens = parser.getTokens();
            for (int j = 0; j < tokens.size(); j++) {
                Object next = tokens.get(j);
                if (next instanceof PDFOperator) {
                    PDFOperator op = (PDFOperator) next;
                    // Tj and TJ are the two operators that display strings in a PDF
                    if (op.getOperation().equals("Tj")) {
                        // Tj takes one operator and that is the string
                        // to display so lets update that operator
                        COSString previous = (COSString) tokens.get(j - 1);
                        String string = previous.getString();
                        if (string.contains("Good")) {
                            string = string.replace("Good", "Bad");
                            occurrences++;
                        }
                        //Word you want to change. Currently this code changes word "Good" to "Bad"
                        previous.reset();
                        previous.append(string.getBytes("ISO-8859-1"));
                    } else if (op.getOperation().equals("TJ")) {
                        COSArray previous = (COSArray) tokens.get(j - 1);
                        COSString temp = new COSString();

                        String tempString = "";
                        for (int t = 0; t < previous.size(); t++) {

                            if (previous.get(t) instanceof COSString) {
                                tempString += ((COSString) previous.get(t)).getString();

                            }
                        }

                        temp.append(tempString.getBytes("ISO-8859-1"));
                        tempString = "";
                        tempString = temp.getString();
                        if (tempString.contains("Good")) {
                            tempString = tempString.replace("Good", "Bad");
                            occurrences++;
                        }
                        previous.clear();

                        String[] stringArray = tempString.split(" ");

                        for (String string : stringArray) {
                            COSString cosString = new COSString();
                            string = string + " ";
                            cosString.append(string.getBytes("ISO-8859-1"));
                            previous.add(cosString);
                        }

                    }
                }
            }
            // now that the tokens are updated we will replace the page content stream.
            PDStream updatedStream = new PDStream(doc);
            OutputStream out = updatedStream.createOutputStream();
            ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
            tokenWriter.writeTokens(tokens);
            page.setContents(updatedStream);
        }
        System.out.println("number of matches found: " + occurrences);
        doc.save("a.pdf"); //Output file name
    } catch (IOException ex) {
        Logger.getLogger(ReplaceTextInPDF.class.getName()).log(Level.SEVERE, null, ex);
    } catch (COSVisitorException ex) {
        Logger.getLogger(ReplaceTextInPDF.class.getName()).log(Level.SEVERE, null, ex);
    } finally {
        if (doc != null) {
            try {
                doc.close();
            } catch (IOException ex) {
                Logger.getLogger(ReplaceTextInPDF.class.getName()).log(Level.SEVERE, null, ex);
            }
        }
    }

the issue that it's replaced in a bad characters or hidden shape ( as example the bad word becomes only d character), but if i copy and paste it in another place it paste the expected word correctly, also when i search the generated pdf for the new word it doesn't find it, but when i search with the old word it finds it in the replaced places

回答1:

I found aspose, this link shows how to use it to replace text in pdfs, it's easy and works perfect except that it's not free, so the free version is printing copyrights line on the head of pdf file pages http://www.aspose.com/docs/display/pdfjava/Replace+Text+in+Pages+of+a+PDF+Document

来源：https://stackoverflow.com/questions/33390219/bad-characters-when-replacing-text-in-pdf-using-pdfbox

标签

java

pdf

pdfbox