How to programmatically determine whether an RSS feed is a full feed or a partial feed

旧街凉风 提交于 2019-12-03 06:17:00

Look for a link at the end that says "More", "Continued", "Full article", "..." or similar. Unless you want to follow every link on the page and look for the text from the feed plus extra perhaps.

I don't think there is a very clean way of doing this, but here are two "hacky" ones:

I'd parse the RSS's text, and look for any links coming out of it. Granted, there could be multiple links there (some to other blog posts), but if you focus on the last one, and try to come up with a few heuristic words for the title of the link (i.e. "more", "read full", etc), you should be able to get a lot of them. For more confidence, you can only look at the links that point back to the original blog.

A more rigorous method would have you following all the links and trying to compare if the RSS fragment is a subset of the page that comes back, or if there is a substantial overlap. This may not help whenever the site uses a true summary as opposed to fragment of the full post though.

Why not follow the url from the rss-feed and check whether there is more text on this page than in the rss-feed? You would need take a html-parser and put in some general rules.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!