how to get a list of links in a webpage in PHP? [duplicate]

六月ゝ 毕业季﹏ 提交于 2020-01-24 14:38:08

问题


Possible Duplicate:
Parse Website for URLs

How do I get all the links in a webpage using PHP?

I need to get a list of the links :-

Google

I want to fetch the href (http://www.google.com) and the text (Google)

-------------------situation is:-

I'm building a crawler and i want it to get all the links that exist in a database table.


回答1:


There are a couple of ways to do this, but the way I would approach this is something like the following,

Use cURL to fetch the page, ie:

// $target_url has the url to be fetched, ie: "http://www.website.com"
// $userAgent should be set to a friendly agent, sneaky but hey... 

$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);

 $ch = curl_init();
 curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
 curl_setopt($ch, CURLOPT_URL,$target_url);
 curl_setopt($ch, CURLOPT_FAILONERROR, true);
 curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
 curl_setopt($ch, CURLOPT_AUTOREFERER, true);
 curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
 curl_setopt($ch, CURLOPT_TIMEOUT, 10);
 $html = curl_exec($ch);
 if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
 }

If all goes well, page content is now all in $html.

Let's move on and load the page in a DOM Object:

$dom = new DOMDocument();
@$dom->loadHTML($html);

So far so good, XPath to the rescue to scrape the links out of the DOM object:

$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

Loop through the result and get the links:

for ($i = 0; $i < $hrefs->length; $i++) {
 $href = $hrefs->item($i);
 $link = $href->getAttribute('href');
 $text = $href->nodeValue

     // Do what you want with the link, print it out:
     echo $text , ' -> ' , $link;

    // Or save this in an array for later processing..
    $links[$i]['href'] = $link;
    $links[$i]['text'] = $text;                         
} 

$hrefs is an object of type DOMNodeList and item() returns a DOMNode object for the specified index. So basically we’ve got a loop that retrieves each link as a DOMNode object.

This should pretty much do it for you. The only part I am not 100% sure of is if the link is an image or an anchor, what would happen in those conditions, I have no idea so you would need to test and filter those out.

Hope this gives you an idea of how to scrape links, happy coding.



来源:https://stackoverflow.com/questions/6314936/how-to-get-a-list-of-links-in-a-webpage-in-php

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!