问题
I'm trying to pull meta tags out of a html page, to compare two pages (live and dev) to see if they're SEO is the same after a site redesign/refactor. I need to compare title, meta tags (description, opengraph etc.), h1's, our analytics (Omniture), and our ad tags (doubleclick) are all the same.
My problem is getting meta tags http://php.net/manual/en/function.get-meta-tags.php only works if they have a name= attribute, same with "mariano at cricava dot com"'s solution.
I don't want to restrict it to having certain attributes, I could make the assumption that all our meta tags have either a name=, or property= or http-equiv= and change the regex appropriately but cannot be entirely sure as it's a massive website and any random crap could be in the tags (hence this tool is to check this stuff!) and would like to leave it as dynamic as possible.
I have
$page = @file_get_contents('http://.../');
preg_match_all('#<meta(?:\s+?([^\=]+)\=\"(.+?)\")+?\s*?/?>#sui', $page, $matches, PREG_SET_ORDER)
but the subpatterns override each other, so this only pulls out the last attribute-name=attribute-value pair
Array
(
[0] => Array
(
[0] => <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
[1] => content
[2] => text/html; charset=UTF-8
)
[1] => Array
(
[0] => <meta name="description" content="some description" />
[1] => content
[2] => some description
)
[2] => Array
(
[0] => <meta property="og:type" content="website" />
[1] => content
[2] => website
)
...
I need all the attributes for all the meta tags. I could do this in two steps, pulling the contents of <meta ([^>]*)>
then doing a second regular expression on the results, but that seems unnecessary with the power of regex?
回答1:
But back to the original question, forget it's HTML for now, is there no way to have recurring subpatterns return in preg_match_all rather than just returning the last match?
Not possible with preg_*
/PCRE (nor any other regex flavor that I know of, but in Perl you could use a (?{ push @list, $^N })
hack).
回答2:
preg_match_all("<meta\\s*(?:(?:\\b(\\w|-)+\\b\\s*(?:=\\s*(?:[\"\"[^\"\"]*\"\"|'[^']*'|
[^\"\"'<> ]|[''[^'']*''|\"[^\"]*\"|[^''\"<> ]]]+)\\s*)?)*)/?\\s*>", $content, $meta);
try with this
回答3:
I am doing it this way. First pull out the meta tags with the following regex
string regex = "<meta\\s(?:\"[^\"]*\"['\"]*|'[^']*'['\"]*|[^'\">])+>";
I found the regex over here -
RegEx match open tags except XHTML self-contained tags
Then pull out attributes using another regex, which would be quite simple to write.
来源:https://stackoverflow.com/questions/6723278/regex-to-pull-all-attributes-out-of-all-meta-tags