PHP parsing xml file error

问题

I am trying to use simpleXML to get data from http://rates.fxcm.com/RatesXML Using simplexml_load_file() i had errors at times as this website always has weird strings/numbers before and after the xml file. Example:

2000<?xml version="1.0" encoding="UTF-8"?>
<Rates>
    <Rate Symbol="EURUSD">
    <Bid>1.27595</Bid>
    <Ask>1.2762</Ask>
    <High>1.27748</High>
    <Low>1.27385</Low>
    <Direction>-1</Direction>
    <Last>23:29:11</Last>
</Rate>
</Rates>
0

I then decided to use file_get_contents and parse it as a string with simplexml_load_string(), afterwards which I use substr() to remove the strings before and after. However, sometimes the random strings will appear in between the nodes like this:

<Rate Symbol="EURTRY">
    <Bid>2.29443</Bid>
    <Ask>2.29562</Ask>
    <High>2.29841</High>
    <Low>2.28999</Low>

137b

 <Direction>1</Direction>
    <Last>23:29:11</Last>
</Rate>

My question is, is there anyway i can deal with all these random strings at a go with any regex functions regardless of where they are placed? (think that will be a better idea rather than to contact the site to get them to broadcast proper xml files)

回答1:

I believe preprocessing XML with regular expressions might be just as bad as parsing it.

But here is a preg replace that removes all non-whitespace characters, from the beginning of the string, from the end of the string, and after closing/self-closing tags:

$string = preg_replace( '~
    (?|           # start of alternation where capturing group count starts from
                  # 1 for each alternative
      ^[^<]*      # match non-< characters at the beginning of the string
    |             # OR
      [^>]*$      # match non-> characters at the end of the string
    |             # OR
      (           # start of capturing group $1: closing tag
        </[^>]++> # match a closing tag; note the possessive quantifier (++); it
                  # suppresses backtracking, which is a convenient optimization,
                  # the following bit is mutually exclusive anyway (this will be
                  # used throughout the regex)
        \s++      # and the following whitespace
      )           # end of $1
      [^<\s]*+    # match non-<, non-whitespace characters (the "bad" ones)
      (?:         # start subgroup to repeat for more whitespace/non-whitespace
                  # sequences
        \s++      # match whitespace
        [^<\s]++  # match at least one "bad" character
      )*          # repeat
                  # note that this will kind of pattern keeps all whitespace
                  # before the first and the last "bad" character
    |             # OR
      (           # start of capturing group $1: self-closing tag
        <[^>/]+/> # match a self-closing tag
        \s++      # and the following whitespace
      )
      [^<]*+(?:\s++[^<\s]++)*
                  # same as before
    )             # end of alternation
    ~x',
    '$1',
    $input);

And then we simply write back the closing or self-closing tag if there was one.

One of the reasons this approach is not safe is that closing or self-closing tags might occur inside comments or attribute strings. But I can hardly suggest you use an XML parser instead, since your XML parser can't parse the XML either.

来源：https://stackoverflow.com/questions/13447793/php-parsing-xml-file-error

标签

php

regex

parsing

preg-replace

simplexml