Parse XML with PHP and XMLReader

前端 未结 2 620
野性不改
野性不改 2020-12-21 02:29

I\'ve been trying to parse a very large XML file with PHP and XMLReader, but can\'t seem to get the results I am looking for. Basically, I\'m searching a ton of information

相关标签:
2条回答
  • 2020-12-21 02:56

    To gain more flexibility with XMLReader I normally create myself iterators that are able to work on the XMLReader object and provide the steps I need.

    That starts with a simple iteration over all nodes over to the iteration over elements optionally with a specific name. Let's call the last one XMLElementIterator taking the reader and the element name as parameters.

    In your scenario I then would create an iterator that returns a SimpleXMLElement for the current element, taking only the <headend> elements:

    require('xmlreader-iterators.php'); // https://gist.github.com/hakre/5147685
    
    class HeadendIterator extends XMLElementIterator {
        const ELEMENT_NAME = 'headend';
    
        public function __construct(XMLReader $reader) {
            parent::__construct($reader, self::ELEMENT_NAME);
        }
    
        /**
         * @return SimpleXMLElement
         */
        public function current() {
            return simplexml_load_string($this->reader->readOuterXml());
        }
    }
    

    Equipped with this iterator the rest of your job is mainly a piece of cake. First load the 10 gigabyte file:

    $pc      = "78746";
    
    $xmlfile = '../data/lineups.xml';
    $reader  = new XMLReader();
    $reader->open($xmlfile);
    

    And then check if the <headend> element contains the information and if so, display the data / XML:

    foreach (new HeadendIterator($reader) as $headend) {
        /* @var $headend SimpleXMLElement */
        if (!$headend->xpath("/*/postalCodes/postalCode[. = '$pc']")) {
            continue;
        }
    
        echo 'Found, name: ', $headend->name, "\n";
        echo "==========================================\n";
        $headend->asXML('php://stdout');
    }
    

    This does literally what you're trying to achieve: Iterate over the large document (which is memory-friendly) until you find the element(s) you're interested in. You then process on the concrete element and it's XML only; XMLReader::readOuterXml() is a fine tool here.

    Exemplary output:

    Found, name: Grande Gables at The Terrace
    ==========================================
    <?xml version="1.0"?>
    <headend headendId="TX02217">
            <name>Grande Gables at The Terrace</name>
            <mso msoId="17541">Grande Communications</mso>
            <marketIds>
                <marketId type="DMA">635</marketId>
            </marketIds>
            <postalCodes>
                <postalCode>11111</postalCode>
                <postalCode>22222</postalCode>
                <postalCode>33333</postalCode>
                <postalCode>78746</postalCode>
            </postalCodes>
            <location>Austin</location>
            <lineup>
                <station prgSvcId="20014">
                    <chan effDate="2006-01-16" tier="1">002</chan>
                </station>
                <station prgSvcId="10722">
                    <chan effDate="2006-01-16" tier="1">003</chan>
                </station>
            </lineup>
            <areasServed>
                <area>
                    <community>Thorndale</community>
                    <county code="45331" size="D">Milam</county>
                    <state>TX</state>
                </area>
                <area>
                    <community>Thrall</community>
                    <county code="45491" size="B">Williamson</county>
                    <state>TX</state>
                </area>
            </areasServed>
        </headend>
    
    0 讨论(0)
  • 2020-12-21 03:00

    Edit: Oh you want to return the parent chunk? One moment.

    Here's an example to pull out all of the postalCodes into an array.

    http://codepad.org/kHss4MdV

    <?php
    
    $string='<lineups country="USA">
     <headend headendId="TX02217">
      <name>Grande Gables at The Terrace</name>
      <mso msoId="17541">Grande Communications</mso>
      <marketIds>
       <marketId type="DMA">635</marketId>
      </marketIds>
      <postalCodes>
       <postalCode>11111</postalCode>
       <postalCode>22222</postalCode>
       <postalCode>33333</postalCode>
       <postalCode>78746</postalCode>
      </postalCodes>
      <location>Austin</location>
      <lineup>
       <station prgSvcId="20014">
        <chan effDate="2006-01-16" tier="1">002</chan>
       </station>
       <station prgSvcId="10722">
        <chan effDate="2006-01-16" tier="1">003</chan>
       </station>
      </lineup>
      <areasServed>
       <area>
        <community>Thorndale</community>
        <county code="45331" size="D">Milam</county>
        <state>TX</state>
       </area>
       <area>
        <community>Thrall</community>
        <county code="45491" size="B">Williamson</county>
        <state>TX</state>
       </area>
      </areasServed>
     </headend></lineups>';
    
    $dom = new DOMDocument();
    $dom->loadXML($string);
    
    $xpath = new DOMXPath($dom);
    $elements= $xpath->query('//lineups/headend/postalCodes/*[text()=78746]');
    
    if (!is_null($elements)) {
      foreach ($elements as $element) {
        echo "<br/>[". $element->nodeName. "]";
    
        $nodes = $element->childNodes;
        foreach ($nodes as $node) {
          echo $node->nodeValue. "\n";
        }
      }
    }
    

    Outputs:

    <br/>[postalCode]78746
    
    0 讨论(0)
提交回复
热议问题