regex to pull all attributes out of all meta tags

问题

I'm trying to pull meta tags out of a html page, to compare two pages (live and dev) to see if they're SEO is the same after a site redesign/refactor. I need to compare title, meta tags (description, opengraph etc.), h1's, our analytics (Omniture), and our ad tags (doubleclick) are all the same.

My problem is getting meta tags http://php.net/manual/en/function.get-meta-tags.php only works if they have a name= attribute, same with "mariano at cricava dot com"'s solution.

I don't want to restrict it to having certain attributes, I could make the assumption that all our meta tags have either a name=, or property= or http-equiv= and change the regex appropriately but cannot be entirely sure as it's a massive website and any random crap could be in the tags (hence this tool is to check this stuff!) and would like to leave it as dynamic as possible.

I have

$page = @file_get_contents('http://.../');
preg_match_all('#<meta(?:\s+?([^\=]+)\=\"(.+?)\")+?\s*?/?>#sui', $page, $matches, PREG_SET_ORDER)

but the subpatterns override each other, so this only pulls out the last attribute-name=attribute-value pair

Array
(
    [0] => Array
        (
            [0] => <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
            [1] => content
            [2] => text/html; charset=UTF-8
        )

    [1] => Array
        (
            [0] => <meta name="description" content="some description" />
            [1] => content
            [2] => some description
        )

    [2] => Array
        (
            [0] => <meta property="og:type" content="website" />
            [1] => content
            [2] => website
        )
...

I need all the attributes for all the meta tags. I could do this in two steps, pulling the contents of <meta ([^>]*)> then doing a second regular expression on the results, but that seems unnecessary with the power of regex?

回答1:

But back to the original question, forget it's HTML for now, is there no way to have recurring subpatterns return in preg_match_all rather than just returning the last match?

Not possible with preg_*/PCRE (nor any other regex flavor that I know of, but in Perl you could use a (?{ push @list, $^N }) hack).

回答2:

 preg_match_all("<meta\\s*(?:(?:\\b(\\w|-)+\\b\\s*(?:=\\s*(?:[\"\"[^\"\"]*\"\"|'[^']*'|
   [^\"\"'<> ]|[''[^'']*''|\"[^\"]*\"|[^''\"<> ]]]+)\\s*)?)*)/?\\s*>", $content, $meta);

try with this

回答3:

I am doing it this way. First pull out the meta tags with the following regex

string regex = "<meta\\s(?:\"[^\"]*\"['\"]*|'[^']*'['\"]*|[^'\">])+>";

I found the regex over here -

RegEx match open tags except XHTML self-contained tags

Then pull out attributes using another regex, which would be quite simple to write.

来源：https://stackoverflow.com/questions/6723278/regex-to-pull-all-attributes-out-of-all-meta-tags

标签

php

regex

preg-match-all

meta-tags