Fix malformed XML in PHP before processing using DOMDocument functions

后端 未结 3 1357
抹茶落季
抹茶落季 2020-11-27 20:50

I\'m needing to load an XML document into PHP that comes from an external source. The XML does not declare it\'s encoding and contains illegal characters like &

相关标签:
3条回答
  • 2020-11-27 20:58

    If tidy extension is not an option, you may consider htmlpurifier.

    0 讨论(0)
  • 2020-11-27 21:12

    Try using the Tidy library which can be used to clean up bad HTML and XML http://php.net/manual/en/book.tidy.php

    A pure PHP solution to fix some XML like this:

    <?xml version="1.0"?>
    <feed>
    <RECORD>
    <ID>117387</ID>
    <ADVERTISERNAME>Test < texter</ADVERTISERNAME>
    <AID>10544740</AID>
    <NAME>This & This</NAME>
    <DESCRIPTION>For one day only this is > than this.</DESCRIPTION>
    </RECORD>
    </feed>
    

    Would be something like this:

      function cleanupXML($xml) {
        $xmlOut = '';
        $inTag = false;
        $xmlLen = strlen($xml);
        for($i=0; $i < $xmlLen; ++$i) {
            $char = $xml[$i];
            // $nextChar = $xml[$i+1];
            switch ($char) {
            case '<':
              if (!$inTag) {
                  // Seek forward for the next tag boundry
                  for($j = $i+1; $j < $xmlLen; ++$j) {
                     $nextChar = $xml[$j];
                     switch($nextChar) {
                     case '<':  // Means a < in text
                       $char = htmlentities($char);
                       break 2;
                     case '>':  // Means we are in a tag
                       $inTag = true;
                       break 2;
                     }
                  }
              } else {
                 $char = htmlentities($char);
              }
              break;
            case '>':
              if (!$inTag) {  // No need to seek ahead here
                 $char = htmlentities($char);
              } else {
                 $inTag = false;
              }
              break;
            default:
              if (!$inTag) {
                 $char = htmlentities($char);
              }
              break;
            }
            $xmlOut .= $char;
        }
        return $xmlOut;
      }
    

    Which is a simple state machine noting whether we are in a tag or not and if not then encoding the text using htmlentities.

    It's worth noting that this will be memory hungry on large files so you may want to rewrite it as a stream plugin or a pre-processor.

    0 讨论(0)
  • 2020-11-27 21:16

    To solve this issue, set the DomDocument recover property to TRUE before loading XML Document

    $dom->recover = TRUE;

    Try this code:

    $feedURL = '3704017_14022010_050004.xml';
    $dom = new DOMDocument();
    $dom->recover = TRUE;
    $dom->load($feedURL);
    
    0 讨论(0)
提交回复
热议问题