ASCII to HTML-Entities Escaping in Java

I found this website with escape codes and I'm just wondering if someone has done this already so I don't have to spend couple of hours building this logic:

 StringBuffer sb = new StringBuffer();
 int n = s.length();
 for (int i = 0; i < n; i++) {
     char c = s.charAt(i);
     switch (c) {
         case '\u25CF': sb.append("&#9679;"); break;
         case '\u25BA': sb.append("&#9658;"); break;

         /*
         ... the rest of the hex chars literals to HTML entities
         */  

         default:  sb.append(c); break;
     }
 }

These "codes" is a mere decimal representation of the unicode value of the actual character. It seems to me that something like this would work, unless you want to be very strict about which codes get converted, and which don't.

StringBuilder sb = new StringBuilder();
 int n = s.length();
 for (int i = 0; i < n; i++) {
     char c = s.charAt(i);
     if (Character.UnicodeBlock.of(c) != Character.UnicodeBlock.BASIC_LATIN) {
        sb.append("&#");
        sb.append((int)c);
        sb.append(';');
     } else {
        sb.append(c);
     }

 }

The other answers don't work correctly for surrogate pairs, e.g. if you have Emojis such as "😀" (see character info). Here's how to do it in Java 8:

StringBuilder sb = new StringBuilder();
s.codePoints().forEach(codePoint -> {
    if (Character.UnicodeBlock.of(codePoint) != Character.UnicodeBlock.BASIC_LATIN) {
        sb.append("&#");
        sb.append(codePoint);
        sb.append(';');
    } else {
        sb.appendCodePoint(codePoint);
    }
});

And for older Java:

StringBuilder sb = new StringBuilder();
for (int i = 0; i < s.length(); ) {
    int c = s.codePointAt(i);
    if (Character.UnicodeBlock.of(c) != Character.UnicodeBlock.BASIC_LATIN) {
        sb.append("&#");
        sb.append(c);
        sb.append(';');
    } else {
        sb.appendCodePoint(c);
    }
    i += Character.charCount(c);
}

A simple way to test if a solution handles surrogate pairs correctly is to use "\uD83D\uDE00" (😀) as the input. If the output is "&#55357;&#56832;", then it's wrong. The correct output is 😀.

Hmm, what if you did something like this instead:

if (c > 127) {
    sb.append("&#" + (int) c + ";");
} else {
    sb.append(c);
}

Then you just need to determine the range of characters you want HTML escaped. In this case I just specified any character beyond the ASCII table space.

来源：https://stackoverflow.com/questions/5440768/ascii-to-html-entities-escaping-in-java

标签

java

escaping

ascii

html-entities