Using cURL to get all links in a website (not only the page)

大城市里の小女人 提交于 2019-12-07 11:39:17

问题


I use the following PHP script to get all the links on a given page, but I'm trying to get all the links on a website as a whole.

<?php

    function urlLooper($url){

        $urlArray = array();

        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        $result = curl_exec($ch);

        $regex='|<a.*?href="(.*?)"|';
        preg_match_all($regex,$result,$parts);
        $links=$parts[1];
        foreach($links as $link){
            array_push($urlArray, $link);
        }
        curl_close($ch);

        foreach($urlArray as $value){
            echo $value . '<br />';
        }
    }

    $url = 'http://www.justfundraising.com/';
    urlLooper($url);

?>

Is there any way to use cURL (or any other method frankly) to get all the links on a website? I have access to the server in case you're wondering.

My idea was to generate all the links from, say, the homepage and then pass those links back through the same function to get a new list of links ignoring any duplicates. I figure that way I'll get all the pages.

Any help will be appreciated!


回答1:


As @mario mentions above perhaps look into using phpQuery (http://code.google.com/p/phpquery/). Once you have downloaded the library and included it on your page, below is some example code showing how you can get an array that contains all the links from the string you pass to it (I have just hardcoded a string in the newDocument function as an example):

$links = phpQuery::newDocument('<a href="test1.html">Test 1</a><a href="test2.html">Test 2</a><a href="test3.html">Test 3</a>')->find('a');
$array_links = array();
foreach($links as $r) {
    $array_links[] = pq($r)->attr('href');
}
die("<pre>".print_r($array_links,true)."</pre>");

The above code will return:

Array
(
    [0] => test1.html
    [1] => test2.html
    [2] => test3.html
)

Hope this helps.




回答2:


curl only fetches what you tell it to. It won't parse the contents for you, and it won't recursively fetch "external" resources referred to by the content. You'll have to rummage around in the returned HTML yourself, parse out image/script links, and use more curl calls to fetch those.

In other words, you'll have to replicate wget, which boils down to: just use wget.




回答3:


i was trying the same using simplehtmldom . but the code crashed after some time . actually i was trying to use dfs method here which can overflow the stack at one point .

you can check this method using cURL

here is my code :

<?php
traverse($home,0);

function traverse($url,$depth)
{
if($depth>1)return;
$html = file_get_html($url);
foreach($html->find('a') as $element)
{
    $nurl = $element->href;
    echo $nurl."<br>";
    traverse($nurl,$depth+1);

}
}
?>


来源:https://stackoverflow.com/questions/7031058/using-curl-to-get-all-links-in-a-website-not-only-the-page

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!