What is the fastest way to programatically check the well-formedness of XML files in C#?

扶醉桌前 提交于 2019-12-03 15:00:31

I would expect that XmlReader with while(reader.Read)() {} would be the fastest managed approach. It certainly shouldn't take seconds to read 40KB... what is the input approach you are using?

Do you perhaps have some external (schema etc) entities to resolve? If so, you might be able to write a custom XmlResolver (set via XmlReaderSettings) that uses locally cached schemas rather than a remote fetch...

The following does ~300KB virtually instantly:

    using(MemoryStream ms = new MemoryStream()) {
        XmlWriterSettings settings = new XmlWriterSettings();
        settings.CloseOutput = false;
        using (XmlWriter writer = XmlWriter.Create(ms, settings))
        {
            writer.WriteStartElement("xml");
            for (int i = 0; i < 15000; i++)
            {
                writer.WriteElementString("value", i.ToString());
            }
            writer.WriteEndElement();
        }
        Console.WriteLine(ms.Length + " bytes");
        ms.Position = 0;
        int nodes = 0;
        Stopwatch watch = Stopwatch.StartNew();
        using (XmlReader reader = XmlReader.Create(ms))
        {
            while (reader.Read()) { nodes++; }
        }
        watch.Stop();
        Console.WriteLine("{0} nodes in {1}ms", nodes,
            watch.ElapsedMilliseconds);
    }
Cerebrus

Create an XmlReader object by passing in an XmlReaderSettings object that has the ConformanceLevel.Document.

This will validate well-formedness.

This MSDN article should explain the details.

On my fairly ordinary laptop, reading a 250K XML document from start to finish with an XmlReader takes 6 milliseconds. Something else besides just parsing XML is the culprit.

i know im necro posting but i think this could be a solution

  1. use HTML Tidy to clear your xml. set the option to remove the doctype
  2. then read the generated xhtml/xml from tidy.

here's a same code

public void GetDocumentStructure(int documentID)
    {
        string scmRepoPath = ConfigurationManager.AppSettings["SCMRepositoryFolder"];
        string docFilePath = scmRepoPath + "\\" + documentID.ToString() + ".xml";

        string docFilePath2 = scmRepoPath + "\\" + documentID.ToString() + "_clean.xml";

        Tidy tidy = new Tidy();
        tidy.Options.MakeClean = true;
        tidy.Options.NumEntities = true;
        tidy.Options.Xhtml = true;
        // this option removes the DTD on the generated output of Tidy
        tidy.Options.DocType = DocType.Omit;

        FileStream input = new FileStream(docFilePath, FileMode.Open);            
        MemoryStream output = new MemoryStream();
        TidyMessageCollection msgs = new TidyMessageCollection();
        tidy.Parse(input, output, msgs);            
        output.Seek(0, SeekOrigin.Begin);

        XmlReader rd = XmlReader.Create(output);            
        int node = 0;

        System.Diagnostics.Stopwatch watch = System.Diagnostics.Stopwatch.StartNew();
        while (rd.Read())
        {                
            ++node;                
        }
        watch.Stop();

        Console.WriteLine("Duration was : " + watch.Elapsed.ToString());
    }

As others mentioned, the bottleneck is most likely not the XmlReader.

Check if you wouldn't happen to do a lot of string concatenation without a stringbuilder.

That can really nuke your performance.

Ron Savage

Personally, I'm pretty lazy ... so I look for .NET libraries that already solve the problem. Try using the DataSet.ReadXML() function and catch the exceptions. It does a pretty amazing job of explaining the XML format errors.

I'm using this function for verifying strings/fragments

<Runtime.CompilerServices.Extension()>
Public Function IsValidXMLFragment(ByVal xmlFragment As String, Optional Strict As Boolean = False) As Boolean
    IsValidXMLFragment = True

    Dim NameTable As New Xml.NameTable

    Dim XmlNamespaceManager As New Xml.XmlNamespaceManager(NameTable)
    XmlNamespaceManager.AddNamespace("xsd", "http://www.w3.org/2001/XMLSchema")
    XmlNamespaceManager.AddNamespace("xsi", "http://www.w3.org/2001/XMLSchema-instance")

    Dim XmlParserContext As New Xml.XmlParserContext(Nothing, XmlNamespaceManager, Nothing, Xml.XmlSpace.None)

    Dim XmlReaderSettings As New Xml.XmlReaderSettings
    XmlReaderSettings.ConformanceLevel = Xml.ConformanceLevel.Fragment
    XmlReaderSettings.ValidationType = Xml.ValidationType.Schema
    If Strict Then
        XmlReaderSettings.ValidationFlags = (XmlReaderSettings.ValidationFlags Or XmlSchemaValidationFlags.ProcessInlineSchema)
        XmlReaderSettings.ValidationFlags = (XmlReaderSettings.ValidationFlags Or XmlSchemaValidationFlags.ReportValidationWarnings)
    Else
        XmlReaderSettings.ValidationFlags = XmlSchemaValidationFlags.None
        XmlReaderSettings.ValidationFlags = (XmlReaderSettings.ValidationFlags Or XmlSchemaValidationFlags.AllowXmlAttributes)
    End If

    AddHandler XmlReaderSettings.ValidationEventHandler, Sub() IsValidXMLFragment = False
    AddHandler XmlReaderSettings.ValidationEventHandler, AddressOf XMLValidationCallBack

    Dim XmlReader As Xml.XmlReader = Xml.XmlReader.Create(New IO.StringReader(xmlFragment), XmlReaderSettings, XmlParserContext)
    While XmlReader.Read
        'Read entire XML
    End While
End Function

I'm using this function for verifying files:

Public Function IsValidXMLDocument(ByVal Path As String, Optional Strict As Boolean = False) As Boolean
    IsValidXMLDocument = IO.File.Exists(Path)
    If Not IsValidXMLDocument Then Exit Function

    Dim XmlReaderSettings As New Xml.XmlReaderSettings
    XmlReaderSettings.ConformanceLevel = Xml.ConformanceLevel.Document
    XmlReaderSettings.ValidationType = Xml.ValidationType.Schema
    If Strict Then
        XmlReaderSettings.ValidationFlags = (XmlReaderSettings.ValidationFlags Or XmlSchemaValidationFlags.ProcessInlineSchema)
        XmlReaderSettings.ValidationFlags = (XmlReaderSettings.ValidationFlags Or XmlSchemaValidationFlags.ReportValidationWarnings)
    Else
        XmlReaderSettings.ValidationFlags = XmlSchemaValidationFlags.None
        XmlReaderSettings.ValidationFlags = (XmlReaderSettings.ValidationFlags Or XmlSchemaValidationFlags.AllowXmlAttributes)
    End If
    XmlReaderSettings.CloseInput = True

    AddHandler XmlReaderSettings.ValidationEventHandler, Sub() IsValidXMLDocument = False
    AddHandler XmlReaderSettings.ValidationEventHandler, AddressOf XMLValidationCallBack

    Using FileStream As New IO.FileStream(Path, IO.FileMode.Open)
        Using XmlReader As Xml.XmlReader = Xml.XmlReader.Create(FileStream, XmlReaderSettings)
            While XmlReader.Read
                'Read entire XML
            End While
        End Using
    End Using
End Function
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!