xpath querying when xml format varies

陌路散爱 提交于 2019-12-13 16:29:31

问题


I have a series of variable types like:

abc1A, abc1B, abc3B, ...
xyz1A, xyz2A, xyz3C, ...
data1C, data2A, ...

Stored in a variety of xml formats:

<area name="DataMap">
    <int name="number" nullable="true">
        <case var="abc2,abc3,abc5">11</case>
        <case var="abc4,abc6*">8</case>
        <case var="data1,xyz7,xyz8">22</case>
        <case var="data3A,xyz{9},xyz{5A,5B,5C}">24</case>
        <case var="xyz{6,4A,4B,4C}">20</case>
        <case var="other01">15</case>
    </int>
</area>

I'm hoping to query what an instance like xyz5A, for example, maps to. The query should return 24, but I don't know ahead of time if its reference in the xml node is explicit as in "xyz4A", or via a wildcard like "xyz4*", or in curly braces like above.

This queries for strings on that line and will return a hit successfully:

xpath '/area[@name="DataMap"]/int[@name="number"]/case[contains(@var,"xyz")][contains(@var,"5A")]'

But it also returns a hit for data5A which is not incorrect:

xpath '/area[@name="DataMap"]/int[@name="number"]/case[contains(@var,"data")][contains(@var,"5A")]'

Are there xpath/other query constructs that parse the inconsistent (but I assume valid) xml above? I only seem to be able to query against explicit string matches vs. the wildcard and curly braced formats.


回答1:


Being in bash/perl you are likely bound to libxml. libxml doesn't support XPath 2.0. There are many questions on SO about XPath/XSLT 2.0 with libxml/libxslt and Perl.

XPath 1.0 has a variety (a small one I have to admit) of string functions and you could try to stack them up together. I experimented for a bit and neither did I like the result not did I succeed to cover all possible cases. You would have "ugly" constructs like:

...
or
(contains(@var, ',xyz{') and 
 contains(substring-before(substring-after(@var, ',xyz{'), '}'), '5A') and
     (contains(substring-before(substring-after(@var, ',xyz{'), '}'), ',5A,') or
      starts-with(substring-after(@var, ',xyz{'), '5A,') or
      starts-with(substring-after(@var, ',xyz{'), '5A}') or
      substring-after(substring-before(substring-after(@var, ',xyz{'), '}'), ',5A') = ''))

or
...

And then you would realize that substring-* functions work off of the first occurrence of the matching string and you need even more layers of ands and ors to handle cases like yours:

<case var="data3A,xyz{9},xyz{5A,5B,5C}">24</case>

where there are multiple xyz{ and the one you need is not known to be the first one.

I think this is the case where you forget you have an XML and just do what Perl is good for and treat it as text. As much as I like XML-aware tools for XML processing and data extraction you will likely be better off with regexp and string manipulations in the language that was designed for it.




回答2:


I guess the smartest thing would be to iterate over all variables and programmatically find the matches, not asking XPath to do it.

Barring that, I have at least a few thoughts on the braces; unfortunately, they probably don't help all that much for the * question.

It seems that there are perl XPath implementations where you could write .../case[@var =~ /some_regex/], maybe .../case["xyz4A" =~ to_regex(@var)], and maybe even .../case[explode_braces(@var) =~ /(^|,)xyz4A(,|$)/] (with a suitably written explode_braces function, of course). See http://www.perlmonks.org/?node_id=831612, for example. I would expect the explode_braces way to work much, much easier than the first alternative - and I do use regular expressions quite a lot. Then again, you seem to use bash-regexes, and transforming those to a perl regex should also be relatively straightforward, so if the second idea, works, you may be good to go.

If that does not work, maybe hook into your XML parser or right before it and fix this horrible XML design by expanding the braces?

$input =~ s/\bvar="([^"]*)"}/'var="'+explode_braces($2)+'"'/eg;

(Or something very similar, sorry, I haven't written much perl in the last years. Also, this assumes your xml only uses one type of attribute quotes, but that should be easy to fix, and that the only place where var=" is found is in these attributes, which may be a much harder limitation.)



来源:https://stackoverflow.com/questions/10647147/xpath-querying-when-xml-format-varies

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!