RSS Feeds and image extraction indepth

删除回忆录丶 提交于 2019-12-25 02:25:44

问题


I have spent time trying to solve this problem and this is as far as ive got. basically im trying to pull images from rss feeds. i use magpie to process the feeds as shown below.. this snippet is within a class

function getImagesUrl($str) {
    $a = array();
    $pos = 0;
    $topos;
    $init = 1;

    while($init) {
        $pos = strpos($str, "img",  $pos);
        if($pos != FALSE) {
            $topos = strpos($str, ">", $pos);
            $imagetag = substr($str, $pos, ($topos - $pos));
            $url = $this->getImageUrl($imagetag);
            $pos = $topos;
            array_push($a, $url);
        }
        else {
            $init = 0;
        }
    }
    return $a;
}


/*
 * get the full url inside src atribute in <img>
*/
function getImageUrl($image) {
    $p = strpos($image, "src=", 0);
    $p+= 5; // remove o src="
    $tp = strpos($image, '" ', $p);
    $str = substr($image, $p, ($tp - $p));
    return $str;
}                

using the above functions... i call them this way... so far this outputs the data i'll paste later on

            @$rss = fetch_rss($rsso->url);
            if (@$rss)
                {
                $items=$rss->items;
                  foreach ($items as $item ) 
                    {
                    if (isset($item['title'])&&isset($item['description']))
                        {
                    $hash=md5($this->es($item['title']).$this->es($item['description']));
                     $content = $item['content'];
                    foreach($content as $c) {
                        // get the images on content
                        $arr = $this->getImagesUrl($c);
                        print_r($arr);
                        }

here is an example of output

 1. Array ( [0] =>
    http://api.tweetmeme.com/imagebutton.gif?url=http://mashable.com/2010/09/25/trailmeme/
    [1] =>
    http://cdn.mashable.com/wp-content/plugins/wp-digg-this/i/gbuzz-feed.png
    [2] =>
    http://mashable.com/wp-content/plugins/wp-digg-this/i/fb.jpg
    [3] =>
    http://mashable.com/wp-content/plugins/wp-digg-this/i/diggme.png
    [4] =>
    http://ec.mashable.com/wp-content/uploads/2009/01/bizspark2.gif
    [5] =>
    http://cdn.mashable.com/wp-content/uploads/2010/09/web.png
    [6] =>
    http://mashable.com/wp-content/uploads/2010/09/Screen-shot-2010-09-24-at-10.51.26-PM.png
    [7] =>
    http://cdn.mashable.com/wp-content/uploads/2009/02/bizspark.jpg
    [8] =>
    http://feedads.g.doubleclick.net/~at/lxx00QTjYBaYojpnpnTa6MXUmh4/0/di
    [9] => [10] =>
    http://feedads.g.doubleclick.net/~at/lxx00QTjYBaYojpnpnTa6MXUmh4/1/di
    [11] => [12] =>
    http://feeds.feedburner.com/~ff/Mashable?i=0N_mvMwPHYk:j5Pmi_N-JQ8:D7DqB2pKExk [13] => [14] =>
    http://feeds.feedburner.com/~ff/Mashable?i=0N_mvMwPHYk:j5Pmi_N-JQ8:V_sGLiPBpWU [15] => [16] =>
    http://feeds.feedburner.com/~ff/Mashable?i=0N_mvMwPHYk:j5Pmi_N-JQ8:F7zBnMyn0Lo [17] => [18] =>
    http://feeds.feedburner.com/~ff/Mashable?d=qj6IDK7rITs
    [19] => [20] =>
    http://feeds.feedburner.com/~ff/Mashable?d=_e0tkf89iUM
    [21] => [22] =>
    http://feeds.feedburner.com/~ff/Mashable?i=0N_mvMwPHYk:j5Pmi_N-JQ8:gIN9vFwOqvQ [23] => [24] =>
    http://feeds.feedburner.com/~ff/Mashable?d=yIl2AUoC8zA
    [25] => [26] =>
    http://feeds.feedburner.com/~ff/Mashable?d=P0ZAIrC63Ok
    [27] => [28] =>
    http://feeds.feedburner.com/~ff/Mashable?d=I9og5sOYxJI
    [29] => [30] =>
    http://feeds.feedburner.com/~ff/Mashable?d=CC-BsrAYo0A
    [31] => [32] =>
    http://feeds.feedburner.com/~ff/Mashable?i=0N_mvMwPHYk:j5Pmi_N-JQ8:_cyp7NeR2Rw [33] => [34] =>
    http://feeds.feedburner.com/~r/Mashable/~4/0N_mvMwPHYk
    )

is there a way i can filter out the correct url for image? for example.... i would like to strip out urls with no extensions of "jpg,png,gif" etc. secondly, i would like to scrap urls with eg bizspark, digg, facebook, tweet, twitter etc. anybody found any easier way of doing this? please help me out


回答1:


I posted an answer to your related question here: Pulling Images from rss/atom feeds using magpie rss

To apply that answer to your code above, first make the changes to rss_parse.inc as per my previous answer. Then you can simply access the image urls via Magpie (instead of having to write any extra functions) e.g.

// Your code
@$rss = fetch_rss($rsso->url);
if (@$rss)
{
   $items=$rss->items;
   foreach ($items as $item ) 
   {
      if (isset($item['title'])&&isset($item['description']))
      {
         // START MY EDIT
         if (isset($item['enclosure_type']) && isset($item['enclosure_url'])){
            switch ($item['enclosure_type']){
               case "image/gif":
               case "image/jpeg":
               case "image/png":
                   $image_url=$item['enclosure_url'];
                   $image_length=$item['enclosure_length'];
                   break;
            }
         }
         //END MY EDIT
       }
   }
}

And that's it! You just have to use the $image_url var to display your image (in an img tag of course :-)

I have only checked for jpg, gif and png images in the code above as they're the most popular, but you can add other mime-types to the switch if you need to. Just be aware that the enclosure type is set by the creator of the RSS feed and not read from the file, so it may not be accurate. You might want to use exif_imagetype() on the image file itself to ensure it actually is an image.

Hope this helps if its not too late!



来源:https://stackoverflow.com/questions/3793768/rss-feeds-and-image-extraction-indepth

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!