removing invalid XML characters from a string in java

问题

Hi i would like to remove all invalid XML characters from a string. i would like to use a regular expression with the string.replace method.

line.replace(regExp,"");

what is the right regExp to use ?

invalid XML character is everything that is not this :

[#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

thanks.

回答1:

Java's regex supports supplementary characters, so you can specify those high ranges with two UTF-16 encoded chars.

Here is the pattern for removing characters that are illegal in XML 1.0:

// XML 1.0
// #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
String xml10pattern = "[^"
                    + "\u0009\r\n"
                    + "\u0020-\uD7FF"
                    + "\uE000-\uFFFD"
                    + "\ud800\udc00-\udbff\udfff"
                    + "]";

Most people will want the XML 1.0 version.

Here is the pattern for removing characters that are illegal in XML 1.1:

// XML 1.1
// [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
String xml11pattern = "[^"
                    + "\u0001-\uD7FF"
                    + "\uE000-\uFFFD"
                    + "\ud800\udc00-\udbff\udfff"
                    + "]+";

You will need to use String.replaceAll(...) and not String.replace(...).

String illegal = "Hello, World!\0";
String legal = illegal.replaceAll(pattern, "");

回答2:

Should we consider surrogate characters? otherwise '(current >= 0x10000) && (current <= 0x10FFFF)' will never be true.

Also tested that the regex way seems slower than the following loop.

if (null == text || text.isEmpty()) {
    return text;
}
final int len = text.length();
char current = 0;
int codePoint = 0;
StringBuilder sb = new StringBuilder();
for (int i = 0; i < len; i++) {
    current = text.charAt(i);
    boolean surrogate = false;
    if (Character.isHighSurrogate(current)
            && i + 1 < len && Character.isLowSurrogate(text.charAt(i + 1))) {
        surrogate = true;
        codePoint = text.codePointAt(i++);
    } else {
        codePoint = current;
    }
    if ((codePoint == 0x9) || (codePoint == 0xA) || (codePoint == 0xD)
            || ((codePoint >= 0x20) && (codePoint <= 0xD7FF))
            || ((codePoint >= 0xE000) && (codePoint <= 0xFFFD))
            || ((codePoint >= 0x10000) && (codePoint <= 0x10FFFF))) {
        sb.append(current);
        if (surrogate) {
            sb.append(text.charAt(i));
        }
    }
}

回答3:

Jun's solution, simplified. Using StringBuffer#appendCodePoint(int), I need no char current or String#charAt(int). I can tell a surrogate pair by checking if codePoint is greater than 0xFFFF.

(It is not necessary to do the i++, since a low surrogate wouldn't pass the filter. But then one would re-use the code for different code points and it would fail. I prefer programming to hacking.)

StringBuilder sb = new StringBuilder();
for (int i = 0; i < text.length(); i++) {
    int codePoint = text.codePointAt(i);
    if (codePoint > 0xFFFF) {
        i++;
    }
    if ((codePoint == 0x9) || (codePoint == 0xA) || (codePoint == 0xD)
            || ((codePoint >= 0x20) && (codePoint <= 0xD7FF))
            || ((codePoint >= 0xE000) && (codePoint <= 0xFFFD))
            || ((codePoint >= 0x10000) && (codePoint <= 0x10FFFF))) {
        sb.appendCodePoint(codePoint);
    }
}

回答4:

All these answers so far only replace the characters themselves. But sometimes an XML document will have invalid XML entity sequences resulting in errors. For example, if you have  in your xml, a java xml parser will throw Illegal character entity: expansion character (code 0x2 at ....

Here is a simple java program that can replace those invalid entity sequences.

  public final Pattern XML_ENTITY_PATTERN = Pattern.compile("\\&\\#(?:x([0-9a-fA-F]+)|([0-9]+))\\;");

  /**
   * Remove problematic xml entities from the xml string so that you can parse it with java DOM / SAX libraries.
   */
  String getCleanedXml(String xmlString) {
    Matcher m = XML_ENTITY_PATTERN.matcher(xmlString);
    Set<String> replaceSet = new HashSet<>();
    while (m.find()) {
      String group = m.group(1);
      int val;
      if (group != null) {
        val = Integer.parseInt(group, 16);
        if (isInvalidXmlChar(val)) {
          replaceSet.add("&#x" + group + ";");
        }
      } else if ((group = m.group(2)) != null) {
        val = Integer.parseInt(group);
        if (isInvalidXmlChar(val)) {
          replaceSet.add("&#" + group + ";");
        }
      }
    }
    String cleanedXmlString = xmlString;
    for (String replacer : replaceSet) {
      cleanedXmlString = cleanedXmlString.replaceAll(replacer, "");
    }
    return cleanedXmlString;
  }

  private boolean isInvalidXmlChar(int val) {
    if (val == 0x9 || val == 0xA || val == 0xD ||
            val >= 0x20 && val <= 0xD7FF ||
            val >= 0x10000 && val <= 0x10FFFF) {
      return false;
    }
    return true;
  }

回答5:

From Mark McLaren's Weblog

  /**
   * This method ensures that the output String has only
   * valid XML unicode characters as specified by the
   * XML 1.0 standard. For reference, please see
   * <a href="http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char">the
   * standard</a>. This method will return an empty
   * String if the input is null or empty.
   *
   * @param in The String whose non-valid characters we want to remove.
   * @return The in String, stripped of non-valid characters.
   */
  public static String stripNonValidXMLCharacters(String in) {
      StringBuffer out = new StringBuffer(); // Used to hold the output.
      char current; // Used to reference the current character.

      if (in == null || ("".equals(in))) return ""; // vacancy test.
      for (int i = 0; i < in.length(); i++) {
          current = in.charAt(i); // NOTE: No IndexOutOfBoundsException caught here; it should not happen.
          if ((current == 0x9) ||
              (current == 0xA) ||
              (current == 0xD) ||
              ((current >= 0x20) && (current <= 0xD7FF)) ||
              ((current >= 0xE000) && (current <= 0xFFFD)) ||
              ((current >= 0x10000) && (current <= 0x10FFFF)))
              out.append(current);
      }
      return out.toString();
  }

回答6:

String xmlData = xmlData.codePoints().filter(c -> isValidXMLChar(c)).collect(StringBuilder::new,
                StringBuilder::appendCodePoint, StringBuilder::append).toString();

private boolean isValidXMLChar(int c) {
    if((c == 0x9) ||
       (c == 0xA) ||
       (c == 0xD) ||
       ((c >= 0x20) && (c <= 0xD7FF)) ||
       ((c >= 0xE000) && (c <= 0xFFFD)) ||
       ((c >= 0x10000) && (c <= 0x10FFFF)))
    {
        return true;
    }
    return false;
}

回答7:

From Best way to encode text data for XML in Java?

String xmlEscapeText(String t) {
   StringBuilder sb = new StringBuilder();
   for(int i = 0; i < t.length(); i++){
      char c = t.charAt(i);
      switch(c){
      case '<': sb.append("&lt;"); break;
      case '>': sb.append("&gt;"); break;
      case '\"': sb.append("&quot;"); break;
      case '&': sb.append("&amp;"); break;
      case '\'': sb.append("&apos;"); break;
      default:
         if(c>0x7e) {
            sb.append("&#"+((int)c)+";");
         }else
            sb.append(c);
      }
   }
   return sb.toString();
}

回答8:

If you want to store text elements with the forbidden characters in XML-like form, you can use XPL instead. The dev-kit provides concurrent XPL to XML and XML processing - which means no time cost to the translation from XPL to XML. Or, if you don't need the full power of XML (namespaces), you can just use XPL.

Web Page: HLL XPL