Web crawler links/page logic in PHP

泄露秘密 提交于 2019-12-11 05:47:28

问题


I'm writing a basic crawler that simply caches pages with PHP.

All it does is use get_file_contents to get contents of a webpage and regex to get all the links out <a href="URL">DESCRIPTION</a> - at the moment it returns:

Array {
[url] => URL
[desc] => DESCRIPTION
}

The problem I'm having is figuring out the logic behind determining whether the page link is local or sussing out whether it may be in a completely different local directory.

It could be any number of combinations: i.e. href="../folder/folder2/blah/page.html" or href="google.com" or href="page.html" - the possibilities are endless.

What would be the correct algorithm to approach this? I don't want to lose any data that could be important.


回答1:


First of all, regex and HTML don't mix. Use:

foreach(DOMDocument::loadHTML($source)->getElementsByTagName('a') as $a)
{
  $a->getAttribute('href');
}

Links that may go outside your site start with protocol or //, i.e.

http://example.com
//example.com/

href="google.com" is link to a local file.

But if you want to create static copy of a site, why not just use wget?




回答2:


Let's first consider the properties of local links.

These will either be:

  • relative with no scheme and no host, or
  • absolute with a scheme of 'http' or 'https' and a host that matches the machine from which the script is running

That's all the logic you'd need to identify if a link is local.

Use the parse_url function to separate out the different components of a URL to identify the scheme and host.




回答3:


You would have to look for http:// in the href. Else, you could determine if it starts with ./ or any combination of "./". If you don't find a "/" then you would have to assume that its a file. Would you like a script for this?



来源:https://stackoverflow.com/questions/361285/web-crawler-links-page-logic-in-php

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!