Unicode Regex; Invalid XML characters

前端 未结 6 725
无人共我
无人共我 2020-11-29 20:23

The list of valid XML characters is well known, as defined by the spec it\'s:

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
<         


        
6条回答
  •  粉色の甜心
    2020-11-29 20:51

    I know this isn't exactly an answer to your question, but it's helpful to have it here:

    Regular Expression to match valid XML Characters:

    [\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]
    

    So to remove invalid chars from XML, you'd do something like

    // filters control characters but allows only properly-formed surrogate sequences
    private static Regex _invalidXMLChars = new Regex(
        @"(?
    /// removes any unusual unicode characters that can't be encoded into XML
    /// 
    public static string RemoveInvalidXMLChars(string text)
    {
        if (string.IsNullOrEmpty(text)) return "";
        return _invalidXMLChars.Replace(text, "");
    }
    

    I had our resident regex / XML genius, he of the 4,400+ upvoted post, check this, and he signed off on it.

提交回复
热议问题