Stream parse 4 GB XML file in PHP

梦想与她 提交于 2019-11-29 07:38:44

Here's a college try. This assumes a file is being used, and that you want to write to a file:

<?php

$interestingNodes = array('title','url','abstract');
$xmlObject = new XMLReader();
$xmlObject->open('bigolfile.xml');

$xmlOutput = new XMLWriter();
$xmlOutput->openURI('destfile.xml');
$xmlOutput->setIndent(true);
$xmlOutput->setIndentString("   ");
$xmlOutput->startDocument('1.0', 'UTF-8');

while($xmlObject->read()){
    if($xmlObject->name == 'doc'){
        $xmlOutput->startElement('doc');
        $xmlObject->readInnerXML();
        if(array_search($xmlObject->name, $interestingNodes)){
             $xmlOutput->startElement($xmlObject->name);
             $xmlOutput->text($xmlObject->value);
             $xmlOutput->endElement(); //close the current node
        }
        $xmlOutput->endElement(); //close the doc node
    }
}

$xmlObject->close();
$xmlOutput->endDocument();
$xmlOutput->flush();

?>
higuaro

For this scenario you can't afford to use a DOM parser, as you stated, it will not fit in memory due to the file size, and even if you could, it'll be slow as it first load the entire file and after that you have to iterate through it, so, for this case you should try a SAX parser (event/stream oriented), add a handler for those tag you're insterested in (doc, title, url, abstract) and for every event append the node found in the new XML file.

Here you have more information:

What is the fastest XML parser in PHP?

Here is a (not tested) sample of what the code would be:

<?php
    $file = "bigfile.xml";
    $fh = fopen("out.xml", 'a') or die("can't open file");
    $currentNodeTag = "";    
    $tags = array("doc", "title", "url", "abstract");

    function startElement($parser, $name, $attrs) {
        global $tags;

        if (isset($tags[strtolower($name)])) {
            $currentNodeTag = strtolower($name);
            fwrite($fh, sprintf("<%s>\n"));
        }
    }

    function endElement($parser, $name) {
        global $tags;

        if (isset($tags[strtolower($name)])) {
            fwrite($fh, sprintf("</%s>\n"));
            $currentNodeTag = "";
        }
    }

    function characterData($parser, $data) {
        if (!empty($currentNodeTag)) {
            fwrite($fh, $data);
        }
    }    

    $xmlParser = xml_parser_create();
    xml_set_element_handler($xmlParser, "startElement", "endElement");
    xml_set_character_data_handler ($xmlParser, "characterData");

    if (!($fp = fopen($file, "r"))) {
        die("could not open XML input");
    }

    while ($data = fread($fp, 4096)) {
        if (!xml_parse($xmlParser, $data, feof($fp))) {
            die(sprintf("XML error: %s at line %d",
                        xml_error_string(xml_get_error_code($xmlParser)),
                        xml_get_current_line_number($xmlParser)));
        }
    }

    xml_parser_free($xmlParser);
    fclose($fh);
?>
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!