Why does parsing XML document using MSXML v3.0 work, but MSXML v6.0 doesn't

徘徊边缘 提交于 2021-01-27 12:51:09

问题


So, I am working on a project that scrapes and collects data from many different sources around the internet with many different methods depending on each source's characteristics.

The most recent addition is a web API call which returns the following XML as a response:

<?xml version="1.0"?>
<Publication_MarketDocument xmlns="urn:iec62325.351:tc57wg16:451-3:publicationdocument:7:0">
    <mRID>29b526a69b9445a7bb507ba446e3e8f9</mRID>
    <revisionNumber>1</revisionNumber>
    <type>A44</type>
    <sender_MarketParticipant.mRID codingScheme="A01">10X1001A1001A450</sender_MarketParticipant.mRID>
    <sender_MarketParticipant.marketRole.type>A32</sender_MarketParticipant.marketRole.type>
    <receiver_MarketParticipant.mRID codingScheme="A01">10X1001A1001A450</receiver_MarketParticipant.mRID>
    <receiver_MarketParticipant.marketRole.type>A33</receiver_MarketParticipant.marketRole.type>
    <createdDateTime>2019-09-19T11:28:51Z</createdDateTime>
    <period.timeInterval>
        <start>2019-09-18T22:00Z</start>
        <end>2019-09-19T22:00Z</end>
    </period.timeInterval>
    <TimeSeries>
        <mRID>1</mRID>
        <businessType>A62</businessType>
        <in_Domain.mRID codingScheme="A01">10YCS-SERBIATSOV</in_Domain.mRID>
        <out_Domain.mRID codingScheme="A01">10YCS-SERBIATSOV</out_Domain.mRID>
        <currency_Unit.name>EUR</currency_Unit.name>
        <price_Measure_Unit.name>MWH</price_Measure_Unit.name>
        <curveType>A01</curveType>
        <Period>
            <timeInterval>
                <start>2019-09-18T22:00Z</start>
                <end>2019-09-19T22:00Z</end>
            </timeInterval>
            <resolution>PT60M</resolution>
            <Point>
                <position>1</position>
                <price.amount>44.08</price.amount>
            </Point>
            <Point>
                <position>2</position>
                <price.amount>37.14</price.amount>
            </Point>
            <Point>
                <position>3</position>
                <price.amount>32.21</price.amount>
            </Point>
            <Point>
                <position>4</position>
                <price.amount>31.44</price.amount>
            </Point>
            <Point>
                <position>5</position>
                <price.amount>32.48</price.amount>
            </Point>
            <Point>
                <position>6</position>
                <price.amount>45.52</price.amount>
            </Point>
            <Point>
                <position>7</position>
                <price.amount>56.05</price.amount>
            </Point>
            <Point>
                <position>8</position>
                <price.amount>74.96</price.amount>
            </Point>
            <Point>
                <position>9</position>
                <price.amount>74.08</price.amount>
            </Point>
            <Point>
                <position>10</position>
                <price.amount>69.03</price.amount>
            </Point>
            <Point>
                <position>11</position>
                <price.amount>72.89</price.amount>
            </Point>
            <Point>
                <position>12</position>
                <price.amount>68.91</price.amount>
            </Point>
            <Point>
                <position>13</position>
                <price.amount>74.95</price.amount>
            </Point>
            <Point>
                <position>14</position>
                <price.amount>72.91</price.amount>
            </Point>
            <Point>
                <position>15</position>
                <price.amount>75.97</price.amount>
            </Point>
            <Point>
                <position>16</position>
                <price.amount>76.49</price.amount>
            </Point>
            <Point>
                <position>17</position>
                <price.amount>59.08</price.amount>
            </Point>
            <Point>
                <position>18</position>
                <price.amount>60.19</price.amount>
            </Point>
            <Point>
                <position>19</position>
                <price.amount>64.69</price.amount>
            </Point>
            <Point>
                <position>20</position>
                <price.amount>69.18</price.amount>
            </Point>
            <Point>
                <position>21</position>
                <price.amount>64.97</price.amount>
            </Point>
            <Point>
                <position>22</position>
                <price.amount>63.38</price.amount>
            </Point>
            <Point>
                <position>23</position>
                <price.amount>52.92</price.amount>
            </Point>
            <Point>
                <position>24</position>
                <price.amount>48.08</price.amount>
            </Point>
        </Period>
    </TimeSeries>
</Publication_MarketDocument> 

Having dealt successfully with situations like that using Microsoft XML, v6.0 I tried the following:

Dim respXML As New MSXML2.DOMDocument60
respXML.LoadXML (ThisWorkbook.Worksheets("Sheet2").Range("A1")) 'for the sake of the post's simplicity I'm loading the xml from excel
Debug.Print respXML.getElementsByTagName("price.amount").Length

This should be returning 24 but instead it returns 0. Indeed the following:

Debug.Print respXML.getElementsByTagName("price.amount")(1) Is Nothing

returns True, which means that the <price.amount></price.amount> elements are not being found. However, Debug.Print respXML.XML yields the expected results.

I read somewhere that early binding could be causing problems so I tried the following as well:

Dim respXML As Object
Set respXML = CreateObject("MSXML2.DOMDocument.6.0")
respXML.LoadXML (ThisWorkbook.Worksheets("Sheet2").Range("A1"))
Debug.Print respXML.getElementsByTagName("price.amount").Length
Debug.Print respXML.getElementsByTagName("price.amount")(1) Is Nothing

Still the results are the same.

Switching to Microsoft XML, v3.0 resolves the issue completely.

However, I would prefer sticking to v6.0 since it's the one that is more actively being maintained and supported.

Why does this happen? Does it have to do with the XML itself? Does it have to do with my code? Am I missing something? Is there a way to make it work with Microsoft XML, v6.0?

Any input would be appreciated.


回答1:


To extend @CindyMeister's answer, the issue does appear to be namespace handling between the MSXML versions using getElementsByTagName(). Specifically, your XML maintains an xmlns attribute without colon identified prefix which requires DOM libraries to assign a prefix when parsing content:

<Publication_MarketDocument xmlns="urn:iec62325.351:tc57wg16:451-3:publicationdocument:7:0" ...

However, using SelectionNamespaces + SelectNodes to define a temporary alias, such as doc, to default namespace prefix, both libraries print out expected results. And MS docs even advises the latter method (emphasis added):

The getElementsByTagName method simulates the matching of the provided argument against the result of the tagName property of IXMLDOMElement. When executed, it does not recognize or support namespaces. Instead, you should use the selectNodes method, which is faster in some cases and can support more complex searches.

MXSML v3.0 (prints unexpected getElementsByTagName result)

Sub ParseXMLv3()
    Dim respXML As New MSXML2.DOMDocument30

    respXML.Load "C:\Path\To\Input.xml"
    respXML.setProperty "SelectionLanguage", "XPath"
    respXML.setProperty "SelectionNamespaces", "xmlns:doc='urn:iec62325.351:tc57wg16:451-3:publicationdocument:7:0'"

    Debug.Print respXML.SelectNodes("//doc:price.amount").Length       ' PRINTS 24
    Debug.Print respXML.SelectNodes("//price.amount").Length           ' PRINTS 0
    Debug.Print respXML.getElementsByTagName("price.amount").Length    ' PRINTS 24

    Set respXML = Nothing
End Sub

MSXML v6.0

Sub ParseXMLv6()
    Dim respXML As New MSXML2.DOMDocument60

    respXML.Load "C:\Path\To\Input.xml"
    respXML.setProperty "SelectionLanguage", "XPath"
    respXML.setProperty "SelectionNamespaces", "xmlns:doc='urn:iec62325.351:tc57wg16:451-3:publicationdocument:7:0'"

    Debug.Print respXML.SelectNodes("//doc:price.amount").Length       ' PRINTS 24
    Debug.Print respXML.SelectNodes("//price.amount").Length           ' PRINTS 0
    Debug.Print respXML.getElementsByTagName("price.amount").Length    ' PRINTS 0

    Set respXML = Nothing
End Sub



回答2:


A quick test, here, indicates that none of the nodes/elements are picked up using DOMDocument60.

I am successful using DOMDocument30, still using the MSXML6 parser. So that could be a workaround for you:

'Using the MSXML6 parser, it's still possible to use what worked in older versions
Dim respXML As Msxml2.DOMDocument30
Set respXML = CreateObject("MSXML2.DOMDocument.3.0")

Research on the Internet turns up two useful links, one on MSDN, the other on VB forums.

The first basically says that security properties were added in MSXML6, meaning some things that worked in MSXML2 no longer will in the newer version. These are documented on Microsoft^s site.

I don't know which one it is (if any of these, but closest appears to be the SelectionNamespace property) but another change appears to be how the parser handles "anonymous" namespaces (VB Forums link). If a namespace is declared in a top-level element, with no prefix, then it's not applied to any child elements - so they aren't "seen".

As the XML code in the question contains a namespace with no prefix, this appears to be the issue. If declaring DOMDocument30 won't work for you, and SelectionNamespace doesn't help, then I think the only recourse would be to change/transform the XML to add a prefix for the namespace and to all the elements.



来源:https://stackoverflow.com/questions/58026296/why-does-parsing-xml-document-using-msxml-v3-0-work-but-msxml-v6-0-doesnt

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!