Control code 0x6 causing XML error

问题

I have a Java application running which fetches data by XML, but once in a while i have some data consisting some sort of control code?

An invalid XML character (Unicode: 0x6) was found in the CDATA section.
org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x6) was found in     the CDATA section.
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)
    at domain.Main.processLogFromUrl(Main.java:342)
    at domain.Main.<init>(Main.java:67)
    at domain.Main.main(Main.java:577)

Can anyone explain what this control code exactly does as i cannot find much info?

Thanks in advance.

回答1:

You need to write a FilterInputStream to filter the data before the SAX parser gets it. It must either remove or recode the bad data.

Apache have a super-flexible example. You may wish to put together a much simpler one.

Here's one of mine which does other cleaning up but I am sure it will be a good start.

/* Cleans up often very bad xml. 
 * 
 * 1. Strips leading white space.
 * 2. Recodes &pound; etc to &#...;.
 * 3. Recodes lone & as &amp.
 * 
 */
public class XMLInputStream extends FilterInputStream {

  private static final int MIN_LENGTH = 2;
  // Everything we've read.
  StringBuilder red = new StringBuilder();
  // Data I have pushed back.
  StringBuilder pushBack = new StringBuilder();
  // How much we've given them.
  int given = 0;
  // How much we've read.
  int pulled = 0;

  public XMLInputStream(InputStream in) {
    super(in);
  }

  public int length() {
    // NB: This is a Troll length (i.e. it goes 1, 2, many) so 2 actually means "at least 2"

    try {
      StringBuilder s = read(MIN_LENGTH);
      pushBack.append(s);
      return s.length();
    } catch (IOException ex) {
      log.warning("Oops ", ex);
    }
    return 0;
  }

  private StringBuilder read(int n) throws IOException {
    // Input stream finished?
    boolean eof = false;
    // Read that many.
    StringBuilder s = new StringBuilder(n);
    while (s.length() < n && !eof) {
      // Always get from the pushBack buffer.
      if (pushBack.length() == 0) {
        // Read something from the stream into pushBack.
        eof = readIntoPushBack();
      }

      // Pushback only contains deliverable codes.
      if (pushBack.length() > 0) {
        // Grab one character
        s.append(pushBack.charAt(0));
        // Remove it from pushBack
        pushBack.deleteCharAt(0);
      }

    }
    return s;
  }

  // Returns false at eof.
  // Might not actually push back anything but usually will.
  private boolean readIntoPushBack() throws IOException {
    // File finished?
    boolean eof = false;
    // Next char.
    int ch = in.read();
    if (ch >= 0) {
      // Discard whitespace at start?
      if (!(pulled == 0 && isWhiteSpace(ch))) {
        // Good code.
        pulled += 1;
        // Parse out the &stuff;
        if (ch == '&') {
          // Process the &
          readAmpersand();
        } else {
          // Not an '&', just append.
          pushBack.append((char) ch);
        }
      }
    } else {
      // Hit end of file.
      eof = true;
    }
    return eof;
  }

  // Deal with an ampersand in the stream.
  private void readAmpersand() throws IOException {
    // Read the whole word, up to and including the ;
    StringBuilder reference = new StringBuilder();
    int ch;
    // Should end in a ';'
    for (ch = in.read(); isAlphaNumeric(ch); ch = in.read()) {
      reference.append((char) ch);
    }
    // Did we tidily finish?
    if (ch == ';') {
      // Yes! Translate it into a &#nnn; code.
      String code = XML.hash(reference);
      if (code != null) {
        // Keep it.
        pushBack.append(code);
      } else {
        throw new IOException("Invalid/Unknown reference '&" + reference + ";'");
      }
    } else {
      // Did not terminate properly! 
      // Perhaps an & on its own or a malformed reference.
      // Either way, escape the &
      pushBack.append("&amp;").append(reference).append((char) ch);
    }
  }

  private void given(CharSequence s, int wanted, int got) {
    // Keep track of what we've given them.
    red.append(s);
    given += got;
    log.finer("Given: [" + wanted + "," + got + "]-" + s);
  }

  @Override
  public int read() throws IOException {
    StringBuilder s = read(1);
    given(s, 1, 1);
    return s.length() > 0 ? s.charAt(0) : -1;
  }

  @Override
  public int read(byte[] data, int offset, int length) throws IOException {
    int n = 0;
    StringBuilder s = read(length);
    for (int i = 0; i < Math.min(length, s.length()); i++) {
      data[offset + i] = (byte) s.charAt(i);
      n += 1;
    }
    given(s, length, n);
    return n > 0 ? n : -1;
  }

  @Override
  public String toString() {
    String s = red.toString();
    String h = "";
    // Hex dump the small ones.
    if (s.length() < 8) {
      Separator sep = new Separator(" ");
      for (int i = 0; i < s.length(); i++) {
        h += sep.sep() + Integer.toHexString(s.charAt(i));
      }
    }
    return "[" + given + "]-\"" + s + "\"" + (h.length() > 0 ? " (" + h + ")" : "");
  }

  private boolean isWhiteSpace(int ch) {
    switch (ch) {
      case ' ':
      case '\r':
      case '\n':
      case '\t':
        return true;
    }
    return false;
  }

  private boolean isAlphaNumeric(int ch) {
    return ('a' <= ch && ch <= 'z') 
        || ('A' <= ch && ch <= 'Z') 
        || ('0' <= ch && ch <= '9');
  }
}

回答2:

Quite why you've got that character will depend on what the data is meant to represent. (Apparently it's ACK, but that's odd to represent in a file...) However, the important point is that it makes the XML invalid - you simply can't represent that character in XML.

From the XML 1.0 spec, section 2.2:

Character Range

/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
Char       ::=   #x9 | #xA | #xD | [#x20-#xD7FF] 
                     | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Note how this excludes Unicode values below U+0020 other than U+0009 (tab), U+000A (line-feed) and U+000D (carriage return).

If you have any influence over the data coming back, you should change it to return valid XML. If not, you'll have to do some preprocessing on it before parsing it as XML. Quite what you'll want to do with unwanted control characters depends on what meaning they have in your situation.

回答3:

Try to define your XML as version 1.1:

<?xml version="1.1"?>

来源：https://stackoverflow.com/questions/8266531/control-code-0x6-causing-xml-error

标签

java

xml

unicode

saxparser