Parse XML with PHP and XMLReader

前端未结

关注

 2  620

I\'ve been trying to parse a very large XML file with PHP and XMLReader, but can\'t seem to get the results I am looking for. Basically, I\'m searching a ton of information

相关标签:

2条回答

谎友^

2020-12-21 02:56

To gain more flexibility with XMLReader I normally create myself iterators that are able to work on the XMLReader object and provide the steps I need.

That starts with a simple iteration over all nodes over to the iteration over elements optionally with a specific name. Let's call the last one XMLElementIterator taking the reader and the element name as parameters.

In your scenario I then would create an iterator that returns a SimpleXMLElement for the current element, taking only the <headend> elements:

require('xmlreader-iterators.php'); // https://gist.github.com/hakre/5147685

class HeadendIterator extends XMLElementIterator {
    const ELEMENT_NAME = 'headend';

    public function __construct(XMLReader $reader) {
        parent::__construct($reader, self::ELEMENT_NAME);
    }

    /**
     * @return SimpleXMLElement
     */
    public function current() {
        return simplexml_load_string($this->reader->readOuterXml());
    }
}

Equipped with this iterator the rest of your job is mainly a piece of cake. First load the 10 gigabyte file:

$pc      = "78746";

$xmlfile = '../data/lineups.xml';
$reader  = new XMLReader();
$reader->open($xmlfile);

And then check if the <headend> element contains the information and if so, display the data / XML:

foreach (new HeadendIterator($reader) as $headend) {
    /* @var $headend SimpleXMLElement */
    if (!$headend->xpath("/*/postalCodes/postalCode[. = '$pc']")) {
        continue;
    }

    echo 'Found, name: ', $headend->name, "\n";
    echo "==========================================\n";
    $headend->asXML('php://stdout');
}

This does literally what you're trying to achieve: Iterate over the large document (which is memory-friendly) until you find the element(s) you're interested in. You then process on the concrete element and it's XML only; XMLReader::readOuterXml() is a fine tool here.

Exemplary output:

Found, name: Grande Gables at The Terrace
==========================================
<?xml version="1.0"?>
<headend headendId="TX02217">
        <name>Grande Gables at The Terrace</name>
        <mso msoId="17541">Grande Communications</mso>
        <marketIds>
            <marketId type="DMA">635</marketId>
        </marketIds>
        <postalCodes>
            <postalCode>11111</postalCode>
            <postalCode>22222</postalCode>
            <postalCode>33333</postalCode>
            <postalCode>78746</postalCode>
        </postalCodes>
        <location>Austin</location>
        <lineup>
            <station prgSvcId="20014">
                <chan effDate="2006-01-16" tier="1">002</chan>
            </station>
            <station prgSvcId="10722">
                <chan effDate="2006-01-16" tier="1">003</chan>
            </station>
        </lineup>
        <areasServed>
            <area>
                <community>Thorndale</community>
                <county code="45331" size="D">Milam</county>
                <state>TX</state>
            </area>
            <area>
                <community>Thrall</community>
                <county code="45491" size="B">Williamson</county>
                <state>TX</state>
            </area>
        </areasServed>
    </headend>

0 讨论(0)

独厮守ぢ

2020-12-21 03:00

Edit: Oh you want to return the parent chunk? One moment.

Here's an example to pull out all of the postalCodes into an array.

http://codepad.org/kHss4MdV

<?php

$string='<lineups country="USA">
 <headend headendId="TX02217">
  <name>Grande Gables at The Terrace</name>
  <mso msoId="17541">Grande Communications</mso>
  <marketIds>
   <marketId type="DMA">635</marketId>
  </marketIds>
  <postalCodes>
   <postalCode>11111</postalCode>
   <postalCode>22222</postalCode>
   <postalCode>33333</postalCode>
   <postalCode>78746</postalCode>
  </postalCodes>
  <location>Austin</location>
  <lineup>
   <station prgSvcId="20014">
    <chan effDate="2006-01-16" tier="1">002</chan>
   </station>
   <station prgSvcId="10722">
    <chan effDate="2006-01-16" tier="1">003</chan>
   </station>
  </lineup>
  <areasServed>
   <area>
    <community>Thorndale</community>
    <county code="45331" size="D">Milam</county>
    <state>TX</state>
   </area>
   <area>
    <community>Thrall</community>
    <county code="45491" size="B">Williamson</county>
    <state>TX</state>
   </area>
  </areasServed>
 </headend></lineups>';

$dom = new DOMDocument();
$dom->loadXML($string);

$xpath = new DOMXPath($dom);
$elements= $xpath->query('//lineups/headend/postalCodes/*[text()=78746]');

if (!is_null($elements)) {
  foreach ($elements as $element) {
    echo "<br/>[". $element->nodeName. "]";

    $nodes = $element->childNodes;
    foreach ($nodes as $node) {
      echo $node->nodeValue. "\n";
    }
  }
}

Outputs:

<br/>[postalCode]78746

0 讨论(0)