Scrape and generate RSS feed

别等时光非礼了梦想. 提交于 2019-12-04 15:12:37

Well for just a simple combination of the two loops you could create the feed as your parse through the HTML:

<?php
include("FeedWriter.php");
include('simple_html_dom.php');

$html = file_get_html('http://www.website.com');

//Creating an instance of FeedWriter class. 
$TestFeed = new FeedWriter(RSS2);
$TestFeed->setTitle('Testing & Checking the RSS writer class');
$TestFeed->setLink('http://www.ajaxray.com/projects/rss');
$TestFeed->setDescription(
  'This is test of creating a RSS 2.0 feed Universal Feed Writer');

$TestFeed->setImage('Testing the RSS writer class',
                    'http://www.ajaxray.com/projects/rss',
                    'http://www.rightbrainsolution.com/images/logo.gif');

//parse through the HTML and build up the RSS feed as we go along
foreach($html->find('td[width="380"] p table') as $article) {
  //Create an empty FeedItem
  $newItem = $TestFeed->createNewItem();

  //Look up and add elements to the feed item   
  $newItem->setTitle($article->find('span.title', 0)->innertext);
  $newItem->setDescription($article->find('.ingress', 0)->innertext);
  $newItem->setLink($article->find('.lesMer', 0)->href);     
  $newItem->setDate($article->find('span.presseDato', 0)->plaintext);     

  //Now add the feed item
  $TestFeed->addItem($newItem);
}

$TestFeed->genarateFeed();
?>

What's the issue you're seeing with html_entity_decode, if you give us a link to a page it doesn't work on that might help?

It seems that you loop through the $html to build an array of articles, then loop through these adding to a feed - you can skip a whole loop here by adding items to the feed as they're found. To do this you'll need to move you FeedWriter contstructor up a bit in the execution flow.

I'd also add a couple of methods in to help with readability, which may help maintainability in the long run. Encapsulating your feed creation, item modification etc should make it easier if you ever need to plug a different provider class in for the feed, change parsing rules, etc. There are further improvements that can be made on the below code (html_entity_decode is on a separate line from $item['title'] assignment etc) but you get the general idea.

What is the issue you're having with html_entity_decode? Have you a sample input/output?

<?php

 // This is a minimum example of using the class
 include("FeedWriter.php");
 include('simple_html_dom.php');

 // Create new instance of a feed
 $TestFeed = create_new_feed();

 $html = file_get_html('http://www.website.com');

 // Loop through html pulling feed items out
 foreach($html->find('td[width="380"] p table') as $article) 
 {
    // Get a parsed item
    $item = get_item_from_article($article);

    // Get the item formatted for feed
    $formatted_item = create_feed_item($TestFeed, $item);

    //Now add the feed item
    $TestFeed->addItem($formatted_item);
 }

 //OK. Everything is done. Now generate the feed.
 $TestFeed->generateFeed();


// HELPER FUNCTIONS

/**
 * Create new feed - encapsulated in method here to allow
 * for change in feed class etc
 */
function create_new_feed()
{
     //Creating an instance of FeedWriter class. 
     $TestFeed = new FeedWriter(RSS2);

     //Use wrapper functions for common channel elements
     $TestFeed->setTitle('Testing & Checking the RSS writer class');
     $TestFeed->setLink('http://www.ajaxray.com/projects/rss');
     $TestFeed->setDescription('This is test of creating a RSS 2.0 feed Universal Feed Writer');

     //Image title and link must match with the 'title' and 'link' channel elements for valid RSS 2.0
     $TestFeed->setImage('Testing the RSS writer class','http://www.ajaxray.com/projects/rss','http://www.rightbrainsolution.com/images/logo.gif');

     return $TestFeed;
}


/**
 * Take in html article segment, and convert to usable $item
 */
function get_item_from_article($article)
{
    $item['title'] = $article->find('span.title', 0)->innertext;
    $item['title'] = html_entity_decode($item['title'], ENT_NOQUOTES, 'UTF-8');

    $item['description'] = $article->find('.ingress', 0)->innertext;
    $item['link'] = $article->find('.lesMer', 0)->href;     
    $item['pubDate'] = $article->find('span.presseDato', 0)->plaintext;     

    return $item;
}


/**
 * Given an $item with feed data, create a
 * feed item
 */
function create_feed_item($TestFeed, $item)
{
    //Create an empty FeedItem
    $newItem = $TestFeed->createNewItem();

    //Add elements to the feed item    
    $newItem->setTitle($item['title']);
    $newItem->setLink($item['link']);
    $newItem->setDate($item['pubDate']);
    $newItem->setDescription($item['description']);

    return $newItem;
}
?>

How can I make this code simpler?

I know it's not exactly what you're asking, but do you know about [http://pipes.yahoo.com/pipes/](Yahoo! pipes)?

Maybe you can just use something like Feedity - http://feedity.com which already solves the problem to generate RSS feed from any webpage.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!