How to parse text and image from complex xml

问题

I hope you can help me with that. The XML file looks like this:

<channel><item>
<description>
<div>  <a href="http://image.com">
<span>   
<img src="http://image.com" /> 
</span>
</a>
Lorem Ipsum is simply dummy text of the printing etc... 
</div>
</description>
</item></channel>

I can get the contents of the description tag, but when i do that, i get the whole structure which has lots of css in there and i don't want that. What i really need is to parse the href link and the Lorem Ipsum text only. I'm trying with simple XML, but can't find out, looks too complicated. Any ideas?

edit: code i use to parse xml

$file = new SimpleXMLElement($mydata);
{

    foreach($file->channel->item as $post)
{

    echo $post->description; } }

回答1:

That XML looks very much like an RSS or Atom feed (or an extract from one). The description node would commonly be escaped, or placed inside a section marked <![CDATA[ ... ]]>, which indicates that its contents are to be treated as raw text, even if they contain <, >, or &.

Your sample doesn't indicate that, but if your echo is giving you the whole content, including img tags etc, then that is what is happening, and your question is similar to Trying to Parse Only the Images from an RSS Feed - you need to grab the whole description content, and parse it as a document of its own.

If for some reason the HTML is not being escaped, and is actually being included as a bunch of child nodes inside the XML, then the linked URL can be accessed directly (assuming the structure is always consistent):

echo (string)$post->description->div->a['href'];

As for the text, SimpleXML will concatenate all text content of a particular element (but not from within its children) if you "cast to string" with (string) (echo automatically casts to string, but I'm guessing you'll want to do something other than echo with it eventually).

In your example, the text you want is inside the first (and only) div, so this would display it:

echo (string)$post->description->div;

However, you mention "lots of CSS", which I guess you've left out of your example for simplicity, so I'm not sure how consistent your real content is.

回答2:

That's going to be complicated. ~~You don't have XML there but html. One difference is that a tag can't contain another tag AND some text in XML. That's why~~ I'd use the DOM of PHP (which I haven't used yet but is similar to pure JavaScript).

This is what I have hacked together (untested):

// first create our document
$doc = new DOMDocument('1.0', 'utf-8');
$doc->loadHTML("your html here"); // there is also a loadHTMLFile

// this tries to get an a element which has a href and returns that href
function getAHref ( $doc ) {
    // now get all a elements to get the one with a href
    $aElements = $doc->getElementsByTagName( "a" );
    foreach ( $aElements as $a ) {
        // has this element a href? than return
        if ( $a->hasAttribute( "href" ) ) {
            return $a->getAttribute( "href" );
        }
    }
    // failed? return false
    return false;
}

// tires to get the text in the node
// in your example the text isn't wrapped in anything so this is going to be difficult
function getTextFromNode ( $doc ) {
    // get and loop all divs (assuming the text is always a child of a div)
    $divs = $doc->getElementsByTagName( "div" ); // do we know it's always in that div?
    foreach ( $divs as $div ) {
        // also loop all child nodes to get the text nodes
        foreach ( $div->childNodes as $child ) {
            // is this a text node?
            if ( $child->nodeType == XML_TEXT_NODE ) {
                // is there something in it (new lines count as text nodes)
                if ( trim( $child->nodeValue ) != "" ) {
                    // *pfew* got it
                    return $child->nodeValue;
                }
            }
        }
    }
    // failed? return false
    return false;
}

回答3:

This is the final code that answears the question.

$xml = simplexml_load_file('myfile.xml');

$descriptions = $xml->xpath('//item/description');

foreach ( $descriptions as $description_node ) {

    $description_dom = new DOMDocument();
    $description_dom->loadHTML( (string)$description_node );

    $description_sxml = simplexml_import_dom( $description_dom );

    $imgs = $description_sxml->xpath('//img');
    $text = $description_sxml->xpath('//div');

    foreach($imgs as $image){

    echo (string)$image['src'];     
       }
    foreach($text as $t){

        echo (string)$t;
       }
    }

It is IMSoP's code and i added the $text = $description_sxml->xpath('//div'); to read the text that is inside the <div>.

In my case some of the posts in the xml have multiple <div> and <span> tags, so to parse all of them i might have to add another ->xpath for the <span> or maybe an if... else statement so that if i don't have any content inside <div>, echo the <span> contents instead. Thanks for your replies.

来源：https://stackoverflow.com/questions/14299468/how-to-parse-text-and-image-from-complex-xml

标签

php

xml

simplexml