How can I validate my 3,000,000 line long XML file?

有些话、适合烂在心里 提交于 2020-01-06 06:36:24

问题


I have an XML file. It is nearly correct, but it is not.

Error on line 302211.
Extra Content at the end of the document.

I've spent literally two days trying to debug this, but the file is so big it's nearly impossible. Is there anything I can do ?

Here are the relevant lines also (I include 2 lines before the error code, the error begins on the <seg> tag).

 <tu>
   <tuv xml:lang="en"> 
    <prop type="feed"></prop>
    <seg>
        <bpt i="1" x="1" type="feed">
            test
        </bpt>
        To switch on computer:
        <ept i="1">
            &gt;
        </ept>
        Press device 
        <ph x="2" type="feed">
            &lt;schar _TR=&quot;123&quot; y.io.name
        </ph> or 
        <ph x="3" type="feed">
            &lt;schar _TR=&quot;274&quot; y.io.name=&quot;
        </ph> (Spain) twice. 
    </seg>
 </tuv>
</tu>

Can anyone give me some pointers on finding the issue here? I am using the Notepad++ XML plugin.


回答1:


Background notes

  • The XML fragment you've posted stands on its own as a well-formed XML document – the problem must be somewhere else in your XML.
  • Your particular XML problem is well-formedness, not validity.

Tips for finding XML well-formedness problems

  1. Use an XML parser with better diagnostic messages. Xerces-based tools have very good messages (albeit with a few exceptions).
  2. Know the common problems that cause an XML document not to be well-formed:
    • Missing or mismatched element closing tag.
    • Missing or mismatched attribute quote delimiter.
    • < or & in content rather than &lt or &amp;.
    • Multiple root elements.
    • Incomplete markup after the root element.
    • Multiple XML declarations, or an XML declaration appears other than at the top of the document.
  3. Divide and conquer. Consider this sketch of a huge XML document:

    <root>
       <First>
           <FirstChild>
              <!-- Tons of descendent markup -->
           </FirstChild>
           <SecondChild>
              <!-- Tons of descendent markup -->
           </SecondChild>
       </First>
       <Second>
           <!-- Tons of descendent markup -->
       </Second>
    </root>
    

    Process of elimination:

    1. Delete the First element.
    2. Revalidate.
    3. If error goes away, restore First element and remove Second element.
    4. Else, remove FirstChild element.
    5. Repeat until error can be more easily spotted in the reduced XML document.

See also

  • How to parse invalid (bad / not well-formed) XML?


来源:https://stackoverflow.com/questions/47531968/how-can-i-validate-my-3-000-000-line-long-xml-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!