how to extract links and titles from a .html page?

≯℡__Kan透↙ 提交于 2019-11-26 19:51:10
Toni Michel Caubet

Thank you everyone, I GOT IT!

The final code:

$html = file_get_contents('bookmarks.html');
//Create a new DOM document
$dom = new DOMDocument;

//Parse the HTML. The @ is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
@$dom->loadHTML($html);

//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementsByTagName('a');

//Iterate over the extracted links and display their URLs
foreach ($links as $link){
    //Extract and show the "href" attribute.
    echo $link->nodeValue;
    echo $link->getAttribute('href'), '<br>';
}

This shows you the anchor text assigned and the href for all links in a .html file.

Again, thanks a lot.

This is probably sufficient:

$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $node)
{
  echo $node->nodeValue.': '.$node->getAttribute("href")."\n";
}

Assuming the stored links are in a html file the best solution is probably to use a html parser such as PHP Simple HTML DOM Parser (never tried it myself). (The other option is to search using basic string search or regexp, and you should probably never use regexp to parse html).

After reading the html file using the parser use it's functions to find the a tags:

from the tutorial:

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>'; 

This is an example, you can use in your case this:

$content = file_get_contents('bookmarks.html');

Run this:

<?php

$content = '<html>

<title>Random Website I am Crawling</title>

<body>

Click <a href="http://clicklink.com">here</a> for foobar

Another site is http://foobar.com

</body>

</html>';

$regex = "((https?|ftp)\:\/\/)?"; // SCHEME
$regex .= "([a-z0-9+!*(),;?&=\$_.-]+(\:[a-z0-9+!*(),;?&=\$_.-]+)?@)?"; // User and Pass
$regex .= "([a-z0-9-.]*)\.([a-z]{2,4})"; // Host or IP
$regex .= "(\:[0-9]{2,5})?"; // Port
$regex .= "(\/([a-z0-9+\$_-]\.?)+)*\/?"; // Path
$regex .= "(\?[a-z+&\$_.-][a-z0-9;:@&%=+\/\$_.-]*)?"; // GET Query
$regex .= "(#[a-z_.-][a-z0-9+\$_.-]*)?"; // Anchor


$matches = array(); //create array
$pattern = "/$regex/";

preg_match_all($pattern, $content, $matches); 

print_r(array_values(array_unique($matches[0])));
echo "<br><br>";
echo implode("<br>", array_values(array_unique($matches[0])));

Output:

Array
(
    [0] => http://clicklink.com
    [1] => http://foobar.com
)

http://clicklink.com

http://foobar.com

Raghavendra
$html = file_get_contents('your file path');

$dom = new DOMDocument;

@$dom->loadHTML($html);

$styles = $dom->getElementsByTagName('link');

$links = $dom->getElementsByTagName('a');

$scripts = $dom->getElementsByTagName('script');

foreach($styles as $style)
{

    if($style->getAttribute('href')!="#")

    {
        echo $style->getAttribute('href');
        echo'<br>';
    }
}

foreach ($links as $link){

    if($link->getAttribute('href')!="#")
    {
        echo $link->getAttribute('href');
        echo'<br>';
    }
}

foreach($scripts as $script)
{

        echo $script->getAttribute('src');
        echo'<br>';

}

I wanted to create a CSV of link paths and their text from html pages so I could rip menus etc from sites.

In this example you specify the domain you are interested in so you don't get off site links and then it produces a CSV per document

/**
 * Extracts links to the given domain from the files and creates CSVs of the links
 */


$LinkExtractor = new LinkExtractor('https://www.example.co.uk');

$LinkExtractor->extract(__DIR__ . '/hamburger.htm');
$LinkExtractor->extract(__DIR__ . '/navbar.htm');
$LinkExtractor->extract(__DIR__ . '/footer.htm');

class LinkExtractor {
    public $domain;

    public function __construct($domain) {
      $this->domain = $domain;
    }

    public function extract($file) {
        $html = file_get_contents($file);
        //Create a new DOM document
        $dom = new DOMDocument;

        //Parse the HTML. The @ is used to suppress any parsing errors
        //that will be thrown if the $html string isn't valid XHTML.
        @$dom->loadHTML($html);

        //Get all links. You could also use any other tag name here,
        //like 'img' or 'table', to extract other tags.
        $links = $dom->getElementsByTagName('a');

        $results = [];
        //Iterate over the extracted links and display their URLs
        foreach ($links as $link){
            //Extract and sput the matching links in an array for the CSV
            $href = $link->getAttribute('href');
            $parts = parse_url($href);
            if (!empty($parts['path']) && strpos($this->domain, $parts['host']) !== false) {
                $results[$parts['path']] = [$parts['path'], $link->nodeValue];
            }
        }

        asort($results);
        // Make the CSV
        $fp = fopen($file .'.csv', 'w');
        foreach ($results as $fields) {
            fputcsv($fp, $fields);
        }
        fclose($fp);
    }
}
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!