Problems using extended escape mode for jsoup output

问题

I need to transform a HTML file, by removing certain tags from the file. To do this I have something like this -

import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Entities;
import org.jsoup.nodes.Entities.EscapeMode;

import java.io.IOException;
import java.io.File;
import java.util.*;

public class TestJsoup {
    public static void main(String[] args) throws IOException {
        Validate.isTrue(args.length == 1, "usage: supply url to fetch");
        String url = args[0];

        Document doc = null;
        if(url.contains("http")) {
           doc = Jsoup.connect(url).get();
        } else {
           File f = new File(url);
           doc = Jsoup.parse(f, null);
        }

        /* remove some tags */

        doc.outputSettings().escapeMode(Entities.EscapeMode.extended);
        System.out.println(doc.html());

        return;
    }
}

The issue with the above code is that, when I use extended escape mode, the output has the html tag attributes being html encoded. Is there anyway to avoid this? Using escape mode as base or xhtml doesn't work as some of the non standard extended (like ’) encoding give problems. For ex for the HTML below,

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<head>
<title>Test&reg;</title>
</head>
<body style="background-color:#EDEDED;">
<P>
   <font style="color:#003698; font-weight:bold;">Testing HTML encoding - &rsquo; &copy; with a <a href="http://www.google.com">link</a>
   </font> 
   <br />
</P>
</body>
</html>

The output I get is,

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
 <head>&NewLine;
  <title>Test&reg;</title>&NewLine;
 </head>&NewLine;
 <body style="background-color&colon;&num;EDEDED&semi;">&NewLine;
  <p>&NewLine; <font style="color&colon;&num;003698&semi; font-weight&colon;bold&semi;">Testing HTML encoding - &rsquor; &copy; with a <a href="http&colon;&sol;&sol;www&period;g
oogle&period;com">link</a></font> <br />&NewLine;</p>&NewLine;&NewLine;&NewLine;&NewLine;
 </body>
</html>

Is there anyway to get around this issue?

回答1:

What output encoding character set are you using? (It will default to the input, which if you are loading from URLs, will vary according to the site).

You probably want to explicitly set it to either UTF-8, or ASCII or some other low setting if you are working with systems that cannot deal with UTF-8. If you set the escape mode to base (the default), and the charset to ascii, then any character (like rsquo) than cannot be represented natively in the selected charset will be output as a numerical escape.

For example:

String check = "<p>&rsquo; <a href='../'>Check</a></p>";
Document doc = Jsoup.parse(check);
doc.outputSettings().escapeMode(Entities.EscapeMode.base); // default

doc.outputSettings().charset("UTF-8");
System.out.println("UTF-8: " + doc.body().html());

doc.outputSettings().charset("ASCII");
System.out.println("ASCII: " + doc.body().html());

Gives:

UTF-8: <p>’ <a href="../">Check</a></p>
ASCII: <p>&#8217; <a href="../">Check</a></p>

Hope this helps!

来源：https://stackoverflow.com/questions/6697619/problems-using-extended-escape-mode-for-jsoup-output

标签

html-entities

jsoup