How to parse <media:content> tag in RSS with simplexml

别等时光非礼了梦想. 提交于 2021-02-07 18:14:08

问题


Structure of my RSS from http://rss.cnn.com/rss/edition.rss is:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?>
<?xml-stylesheet type="text/css" media="screen" href="http://rss.cnn.com/~d/styles/itemcontent.css"?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
  <channel>
    <title><![CDATA[CNN.com - RSS Channel - Intl Homepage - News]]></title>
    <description><![CDATA[CNN.com delivers up-to-the-minute news and information on the latest top stories, weather, entertainment, politics and more.]]></description>
    <link>http://www.cnn.com/intl_index.html</link>
    ...

    <item>
      <title><![CDATA[Russia responds to claims it has damaging material on Trump]]></title>
      <description><![CDATA[The Kremlin denied it has compromising information about US President-elect Donald Trump, describing the allegations as "pulp fiction".]]></description>
      <link>http://www.cnn.com/2017/01/11/politics/russia-rejects-trump-allegations/index.html</link>
      <guid isPermaLink="true">http://www.cnn.com/2017/01/11/politics/russia-rejects-trump-allegations/index.html</guid>
      <pubDate>Wed, 11 Jan 2017 14:44:49 GMT</pubDate>
      <media:group>
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-super-169.jpg" height="619" width="1100" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-large-11.jpg" height="300" width="300" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-vertical-large-gallery.jpg" height="552" width="414" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-video-synd-2.jpg" height="480" width="640" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-live-video.jpg" height="324" width="576" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-t1-main.jpg" height="250" width="250" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-vertical-gallery.jpg" height="360" width="270" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-story-body.jpg" height="169" width="300" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-t1-main.jpg" height="250" width="250" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-assign.jpg" height="186" width="248" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-hp-video.jpg" height="144" width="256" />
      </media:group>
    </item>
    ...

  </channel>
</rss>

If you parse this XML with simplexml like this:

  $rss = simplexml_load_file($url, null, LIBXML_NOCDATA);

  $rssjson = json_encode($rss);
  $rssarray = json_decode($rssjson, TRUE);

you will see that <media:content> is simply missing in $rssarray items. So I found a tutorial with "namespace" solution. However, in the example author is using:

foreach ($xml->channel->item as $item) { ... }

but I am using (cannot use foreach for some reasons):

$rssjson = json_encode($rss);
$rssarray = json_decode($rssjson, TRUE);

So I modified the solution for my case like this:

  $rss = simplexml_load_file($url, null, LIBXML_NOCDATA);
  $namespaces = $rss->getNamespaces(true); // get namespaces

  $rssjson = json_encode($rss);
  $rssarray = json_decode($rssjson, TRUE);

  if (isset($rssarray['channel']['item'])) {
    foreach ($rssarray['channel']['item'] as $key => $item) {

      $media_content = $rss->channel->item[$key]->children($namespaces['media']);
      foreach($media_content as $tag) {

        $tagjson = json_encode($tag);
        $tagarray = json_decode($tagjson, TRUE);

      }

    }
  }

But it does not work. For every item I get in $tagarray as a result an array with this structure:

Array(
  'content' => array(
     '0' => array(null),
     '1' => array(null),
     ...
     '11' => array(null),
   )
)

It is an array with as many items as is the count of <media:content> tags, but every item is empty. I need to get an url attribute of every item. What am I doing wrong and getting an empty array?


回答1:


Tags are actually empty:

<media:content ... />
                   ^^

Information is contained in attributes, which can be fetched with SimpleXMLElement::attributes(), e.g.:

$rss = simplexml_load_file($url, null, LIBXML_NOCDATA);
$namespaces = $rss->getNamespaces(true);
$media_content = $rss->channel->item[0]->children($namespaces['media']);
foreach($media_content->group->content as $i){
    var_dump((string)$i->attributes()->url);
}

I suspect the problem comes from the JSON trick. SimpleXML generates all its classes and properties dynamically (they aren't regular PHP classes), what means that you can't fully rely on standard PHP features like print_r() or json_encode(). This gets illustrated if you insert this in the above loop:

var_dump($i, json_encode($i), (string)$i->attributes()->url);
object(SimpleXMLElement)#2 (0) {
}
string(2) "{}"
string(91) "http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-super-169.jpg"
...



回答2:


I had requirement to aggregate RSS news feeds from different source which had images tags in different formats so I used below code:

//Sample Feed 1: https://www.hindustantimes.com/rss/topnews/rssfeed.xml
//Sample Feed 2: https://economictimes.indiatimes.com/rssfeedsdefault.cms

$feed=$_GET['feed'];

$rss = simplexml_load_file($feed);
$namespaces = $rss->getNamespaces(true);

echo '<strong>'. $rss->channel->title . '</strong><br><br>';

foreach ($rss->channel->item as $item) {

    $media_content = $item->children($namespaces['media']);

    foreach($media_content as $i){
        $imageAlt = (string)$i->attributes()->url;
    }

    echo "Link: " . $item->link ."<br>";
    echo "Title: " . $item->title ."<br>";
    echo "Description: " . $item->description ."<br>";
    echo "PubDate: " . $item->pubDate ."<br>";
    echo "Image: " . $item->image ."<br>";
    echo "ImageAlt: " . $imageAlt ."<br>";
    echo "<br><br>";
} 


来源:https://stackoverflow.com/questions/41595926/how-to-parse-mediacontent-tag-in-rss-with-simplexml

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!