Parse Website for URLs

后端 未结 3 1457
执念已碎
执念已碎 2020-12-07 02:49

Just wondering if someone can help me further with the following. I want to parse the URL on this website:http://www.directorycritic.com/free-directory-list.html?pg=1&so

相关标签:
3条回答
  • 2020-12-07 03:11

    You really shouldn’t use regular expressions to parse HTML as it’s to error prone.

    Better use an HTML parser like the one of PHP’s DOM library:

    $code = file_get_contents($url);
    $doc = new DOMDocument();
    $doc->loadHTML($code);
    $links = array();
    foreach ($doc->getElementsByTagName('a') as $element) {
        if ($element->hasAttribute('href')) {
            $links[] = $elements->getAttribute('href');
        }
    }
    

    Note that this will collect the URI references as they appear in the document and not as an absolute URI. You might want to resolve them before.

    It seems that PHP doesn’t provide an appropriate library (or I haven’t found it yet). But see RFC 3986 – Reference Resolution and my answer on Convert a relative URL to an absolute URL with Simple HTML DOM? for further details.

    0 讨论(0)
  • 2020-12-07 03:17

    Try this method

    function getinboundLinks($domain_name) {
    ini_set('user_agent', 'NameOfAgent (<a class="linkclass" href="http://localhost">http://localhost</a>)');
     $url = $domain_name;
    $url_without_www=str_replace('http://','',$url);
    $url_without_www=str_replace('www.','',$url_without_www);
     $url_without_www= str_replace(strstr($url_without_www,'/'),'',$url_without_www);
    $url_without_www=trim($url_without_www);
    $input = @file_get_contents($url) or die('Could not access file: $url');
     $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
    //$inbound=0;
    $outbound=0;
    $nonfollow=0;
    if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) {
    foreach($matches as $match) {
    # $match[2] = link address
     # $match[3] = link text
    //echo $match[3].'<br>';
    if(!empty($match[2]) && !empty($match[3])) {
    if(strstr(strtolower($match[2]),'URL:') || strstr(strtolower($match[2]),'url:') ) {
    $nonfollow +=1;
    } else if (strstr(strtolower($match[2]),$url_without_www) || !strstr(strtolower($match[2]),'http://')) {
         $inbound += 1;
        echo '<br>inbound '. $match[2];
     }
    else if (!strstr(strtolower($match[2]),$url_without_www) && strstr(strtolower($match[2]),'http://')) {
    echo '<br>outbound '. $match[2];
         $outbound += 1;
        }
    }
    }
    }
    $links['inbound']=$inbound;
    $links['outbound']=$outbound;
    $links['nonfollow']=$nonfollow;
    return $links;
    }
    
    // ************************Usage********************************
    $Domain='<a class="linkclass" href="http://zachbrowne.com">http://zachbrowne.com</a>';
    $links=getinboundLinks($Domain);
    echo '<br>Number of inbound Links '.$links['inbound'];
    echo '<br>Number of outbound Links '.$links['outbound'];
    echo '<br>Number of Nonfollow Links '.$links['nonfollow'];
    
    0 讨论(0)
  • 2020-12-07 03:20

    Use HTML Dom Parser

    $html = file_get_html('http://www.example.com/');
    
    // Find all links
    $links = array(); 
    foreach($html->find('a') as $element) 
           $links[] = $element->href;
    

    Now links array contains all URLs of given page and you can use these URLs to parse further.

    Parsing HTML with regular expressions is not a good idea. Here are some related posts:

    • Using regular expressions to parse HTML: why not?
    • RegEx match open tags except XHTML self-contained tags

    EDIT:

    Some Other HTML Parsing tools as described by Gordon in comments below:

    • phpQuery
    • Zend_Dom
    • QueryPath
    • FluentDom
    0 讨论(0)
提交回复
热议问题