问题
I need help with a REGEX that will find a link that comes in different formats based on how it got inserted to the HTML page.
I am capable of reading the pages into PHP. Just not able to the right REGEX that will find URL and insulate them.
I have a few examples on how they are getting inserted. Where sometimes they are plain text links, some of wrapped around them. There is even the odd occasion where text that is not part of the link gets inserted without spacing.
Both Article ID and Article Key are never the same. Article Key however always ends with a numeric. If this is possible I sure could use the help. Thanks
Here are a few examples.
http://www.example.com/ArticleDetails.aspx?ArticleID=3D10045411&AidKey=3D-2086622941
http://example.com/ArticleDetails.aspx?ArticleID=10919199&AidKey=1956996566
<a href="http://www.example.com/ArticleDetails.aspx?ArticleID=10773616&AidKey=1998267392">http://www.example.com/ArticleDetails.aspx?ArticleID=10773616&AidKey=1998267392</a>
<a href="http://www.example.com/ArticleDetails.aspx?ArticleID=10773616&AidKey=1998267392">This is a link description</a>
http://example.com/ArticleDetails.aspx?ArticleID=10975137&AidKey=701321736this is not part of the url.
In the end I am just looking for the URL.
http://example.com/ArticleDetails.aspx?ArticleID=10975137&AidKey=701321736
回答1:
This regex work for me:
/http:\/\/(www\.)?example\.com\/ArticleDetails.aspx\?ArticleID=(.*?)(\&|\&)AidKey=([\d\w-]*)/g
UPDATE:
I added a \d
at the end of the regex.
/http:\/\/(www\.)?example\.com\/ArticleDetails.aspx\?ArticleID=(.*?)(\&|\&)AidKey=([\d\w-]*)\d/g
To use it in PHP you need /.../msi
PHP Example in action: http://ideone.com/N0TKM
回答2:
DO NOT USE A REGEX! Use a XML parser...
$dom = DOMDocument::loadHTMLFile($pathToFile);
$finder = new DOMXpath($dom);
$anchors = $finder->query('//a[@href]');
foreach($anchors as $anchor){
$href = $anchor->getAttribute('href');
if(preg_match($regexToMatchUrls, $href)){
//do stuff
}
}
So $regexToMatchUrls
would be a regex jsut to match the URLs your are looking for... not any of the html which is much simpler - then you can take action when a match occurs.
来源:https://stackoverflow.com/questions/6948901/php-preg-match-to-find-and-locate-a-dynamic-url-from-html-pages