I\'ve been trying to parse a very large XML file with PHP and XMLReader, but can\'t seem to get the results I am looking for. Basically, I\'m searching a ton of information
To gain more flexibility with XMLReader
I normally create myself iterators that are able to work on the XMLReader object and provide the steps I need.
That starts with a simple iteration over all nodes over to the iteration over elements optionally with a specific name. Let's call the last one XMLElementIterator
taking the reader and the element name as parameters.
In your scenario I then would create an iterator that returns a SimpleXMLElement
for the current element, taking only the <headend>
elements:
require('xmlreader-iterators.php'); // https://gist.github.com/hakre/5147685
class HeadendIterator extends XMLElementIterator {
const ELEMENT_NAME = 'headend';
public function __construct(XMLReader $reader) {
parent::__construct($reader, self::ELEMENT_NAME);
}
/**
* @return SimpleXMLElement
*/
public function current() {
return simplexml_load_string($this->reader->readOuterXml());
}
}
Equipped with this iterator the rest of your job is mainly a piece of cake. First load the 10 gigabyte file:
$pc = "78746";
$xmlfile = '../data/lineups.xml';
$reader = new XMLReader();
$reader->open($xmlfile);
And then check if the <headend>
element contains the information and if so, display the data / XML:
foreach (new HeadendIterator($reader) as $headend) {
/* @var $headend SimpleXMLElement */
if (!$headend->xpath("/*/postalCodes/postalCode[. = '$pc']")) {
continue;
}
echo 'Found, name: ', $headend->name, "\n";
echo "==========================================\n";
$headend->asXML('php://stdout');
}
This does literally what you're trying to achieve: Iterate over the large document (which is memory-friendly) until you find the element(s) you're interested in. You then process on the concrete element and it's XML only; XMLReader::readOuterXml() is a fine tool here.
Exemplary output:
Found, name: Grande Gables at The Terrace
==========================================
<?xml version="1.0"?>
<headend headendId="TX02217">
<name>Grande Gables at The Terrace</name>
<mso msoId="17541">Grande Communications</mso>
<marketIds>
<marketId type="DMA">635</marketId>
</marketIds>
<postalCodes>
<postalCode>11111</postalCode>
<postalCode>22222</postalCode>
<postalCode>33333</postalCode>
<postalCode>78746</postalCode>
</postalCodes>
<location>Austin</location>
<lineup>
<station prgSvcId="20014">
<chan effDate="2006-01-16" tier="1">002</chan>
</station>
<station prgSvcId="10722">
<chan effDate="2006-01-16" tier="1">003</chan>
</station>
</lineup>
<areasServed>
<area>
<community>Thorndale</community>
<county code="45331" size="D">Milam</county>
<state>TX</state>
</area>
<area>
<community>Thrall</community>
<county code="45491" size="B">Williamson</county>
<state>TX</state>
</area>
</areasServed>
</headend>
Edit: Oh you want to return the parent chunk? One moment.
Here's an example to pull out all of the postalCodes into an array.
http://codepad.org/kHss4MdV
<?php
$string='<lineups country="USA">
<headend headendId="TX02217">
<name>Grande Gables at The Terrace</name>
<mso msoId="17541">Grande Communications</mso>
<marketIds>
<marketId type="DMA">635</marketId>
</marketIds>
<postalCodes>
<postalCode>11111</postalCode>
<postalCode>22222</postalCode>
<postalCode>33333</postalCode>
<postalCode>78746</postalCode>
</postalCodes>
<location>Austin</location>
<lineup>
<station prgSvcId="20014">
<chan effDate="2006-01-16" tier="1">002</chan>
</station>
<station prgSvcId="10722">
<chan effDate="2006-01-16" tier="1">003</chan>
</station>
</lineup>
<areasServed>
<area>
<community>Thorndale</community>
<county code="45331" size="D">Milam</county>
<state>TX</state>
</area>
<area>
<community>Thrall</community>
<county code="45491" size="B">Williamson</county>
<state>TX</state>
</area>
</areasServed>
</headend></lineups>';
$dom = new DOMDocument();
$dom->loadXML($string);
$xpath = new DOMXPath($dom);
$elements= $xpath->query('//lineups/headend/postalCodes/*[text()=78746]');
if (!is_null($elements)) {
foreach ($elements as $element) {
echo "<br/>[". $element->nodeName. "]";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
Outputs:
<br/>[postalCode]78746