How to get Matlab to read correct amount of xml nodes

给你一囗甜甜゛ 提交于 2019-12-19 06:03:42

问题


I'm reading a simple xml file using matlab's xmlread internal function.

<root>
    <ref>
        <requestor>John Doe</requestor>
        <project>X</project>
    </ref>
</root>

But when I call getChildren() of the ref element, it's telling me that it has 5 children.

It works fine IF I put all the XML in ONE line. Matlab tells me that ref element has 2 children.

It doesn't seem to like the spaces between elements.

Even if I run Canonicalize in oXygen XML editor, I still get the same results. Because Canonicalize still leaves spaces.

Matlab uses java and xerces for xml stuff.

Question:

What can I do so that I can keep my xml file in human readable format (not all in one line) but still have matlab correctly parse it?

Code Update:

filename='example01.xml';
docNode = xmlread(filename);
rootNode = docNode.getDocumentElement;
entries = rootNode.getChildNodes;
nEnt = entries.getLength

回答1:


The XML parser behind the scenes is creating #text nodes for all whitespace between the node elements. Whereever there is a newline or indentation it will create a #text node with the newline and following indentation spaces in the data portion of the node. So in the xml example you provided when it is parsing the child nodes of the "ref" element it returns 5 nodes

  1. Node 1: #text with newline and indentation spaces
  2. Node 2: "requestor" node which in turn has a #text child with "John Doe" in the data portion
  3. Node 3: #text with newline and indentation spaces
  4. Node 4: "project" node which in turn has a #text child with "X" in the data portion
  5. Node 5: #text with newline and indentation spaces

This function removes all of these useless #text nodes for you. Note that if you intentionally have an xml element composed of nothing but whitespace then this function will remove it but for the 99.99% of xml cases this should work just fine.

function removeIndentNodes( childNodes )

numNodes = childNodes.getLength;
remList = [];
for i = numNodes:-1:1
   theChild = childNodes.item(i-1);
   if (theChild.hasChildNodes)
      removeIndentNodes(theChild.getChildNodes);
   else
      if ( theChild.getNodeType == theChild.TEXT_NODE && ...
           ~isempty(char(theChild.getData()))         && ...
           all(isspace(char(theChild.getData()))))
         remList(end+1) = i-1; % java indexing
      end
   end
end
for i = 1:length(remList)
   childNodes.removeChild(childNodes.item(remList(i)));
end

end

Call it like this

tree = xmlread( xmlfile );
removeIndentNodes( tree.getChildNodes );



回答2:


I felt that @cholland answer was good, but I didn't like the extra xml work. So here is a solution to strip the whitespace from a copy of the xml file which is the root cause of the unwanted elements.

fid = fopen('tmpCopy.xml','wt');
str = regexprep(fileread(filename),'[\n\r]+',' ');
str = regexprep(str,'>[\s]*<','><');
fprintf(fid,'%s', str);
fclose(fid);


来源:https://stackoverflow.com/questions/11548590/how-to-get-matlab-to-read-correct-amount-of-xml-nodes

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!