PHP preg_match to find and locate a dynamic URL from HTML Pages

吃可爱长大的小学妹 提交于 2019-12-12 01:27:50

问题


I need help with a REGEX that will find a link that comes in different formats based on how it got inserted to the HTML page.

I am capable of reading the pages into PHP. Just not able to the right REGEX that will find URL and insulate them.

I have a few examples on how they are getting inserted. Where sometimes they are plain text links, some of wrapped around them. There is even the odd occasion where text that is not part of the link gets inserted without spacing.

Both Article ID and Article Key are never the same. Article Key however always ends with a numeric. If this is possible I sure could use the help. Thanks

Here are a few examples.
http://www.example.com/ArticleDetails.aspx?ArticleID=3D10045411&AidKey=3D-2086622941

http://example.com/ArticleDetails.aspx?ArticleID=10919199&AidKey=1956996566    

<a href="http://www.example.com/ArticleDetails.aspx?ArticleID=10773616&amp;AidKey=1998267392">http://www.example.com/ArticleDetails.aspx?ArticleID=10773616&amp;AidKey=1998267392</a>

<a href="http://www.example.com/ArticleDetails.aspx?ArticleID=10773616&amp;AidKey=1998267392">This is a link description</a>

http://example.com/ArticleDetails.aspx?ArticleID=10975137&AidKey=701321736this is not part of the url.

In the end I am just looking for the URL.

http://example.com/ArticleDetails.aspx?ArticleID=10975137&AidKey=701321736

回答1:


This regex work for me:

/http:\/\/(www\.)?example\.com\/ArticleDetails.aspx\?ArticleID=(.*?)(\&|\&amp;)AidKey=([\d\w-]*)/g

UPDATE: I added a \d at the end of the regex.

/http:\/\/(www\.)?example\.com\/ArticleDetails.aspx\?ArticleID=(.*?)(\&|\&amp;)AidKey=([\d\w-]*)\d/g

To use it in PHP you need /.../msi

PHP Example in action: http://ideone.com/N0TKM




回答2:


DO NOT USE A REGEX! Use a XML parser...

$dom = DOMDocument::loadHTMLFile($pathToFile);
$finder = new DOMXpath($dom);
$anchors = $finder->query('//a[@href]');

foreach($anchors as $anchor){
  $href = $anchor->getAttribute('href');
  if(preg_match($regexToMatchUrls, $href)){
    //do stuff
  }
}

So $regexToMatchUrls would be a regex jsut to match the URLs your are looking for... not any of the html which is much simpler - then you can take action when a match occurs.



来源:https://stackoverflow.com/questions/6948901/php-preg-match-to-find-and-locate-a-dynamic-url-from-html-pages

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!