Unicode Regex; Invalid XML characters

前端 未结 6 727
无人共我
无人共我 2020-11-29 20:23

The list of valid XML characters is well known, as defined by the spec it\'s:

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
<         


        
6条回答
  •  温柔的废话
    2020-11-29 21:03

    Another way to remove incorrect XML chars in C# with using XmlConvert.IsXmlChar Method (Available since .NET Framework 4.0)

    public static string RemoveInvalidXmlChars(string content)
    {
       return new string(content.Where(ch => System.Xml.XmlConvert.IsXmlChar(ch)).ToArray());
    }
    

    or you may check that all characters are XML-valid.

    public static bool CheckValidXmlChars(string content)
    {
       return content.All(ch => System.Xml.XmlConvert.IsXmlChar(ch));
    }
    

    .Net Fiddle - https://dotnetfiddle.net/v1TNus

    For example, the vertical tab symbol (\v) is not valid for XML, it is valid UTF-8, but not valid XML 1.0, and even many libraries (including libxml2) miss it and silently output invalid XML.

提交回复
热议问题