问题
I run a PHP script in WordPress that removes the http: and https: protocols from all the links using the following regex:
$links = preg_replace( '/<input\b[^<]*\bvalue=[\"\']https?:\/\/(*SKIP)(*F)|https?:\/\//', '//', $links );
For the first part: <input\b[^<]*\bvalue=[\"\']https?:\/\/(*SKIP)(*F), this skips any <input> tags that have a http: / https: value, such as:
<input type="url" value="http://example.com">
Additionally, I'd like it to skip any <link> tags that have a rel="canonical" attribute:
<link rel="canonical" href="http://example.com/remove-http/" />
Using a regex tester, I've been trying to update the logic. This is what I've come up with so far:
<(input|link)\b[^<]*\(value|rel)=[\"\'](https?:\/\/|canonical)(*SKIP)(*F)|https?:\/\/
But this hasn't worked for me.
回答1:
The (*SKIP)(*F) verbs are used to discard the text matched so far and proceed to search for the next match from the position where the regex index was after matching the text with the pattern before these verbs.
So, to match word1 or word2, drop them and go on to look for word3, you need to use
'~(?:word1|word2)(*SKIP)(*F)|word3~'
The (?:...) non-capturing group will group the alternatives that must be dropped.
In your case, the whole <link...> should be matched, not just up to the attribute. Thus, you need something like link\b[^>]*?\brel=[\'\"]canonical[\'\"][^>]*> instead of word2 in the above regex.
However, you should think about using an HTML parser that is compatible with your environment (I saw your note that the DOMDocument malfunctions there).
回答2:
You should consider using the built in PHP DOM class.
http://php.net/manual/en/book.dom.php
HTML is a very rich language and regex are not powerful enough to parse it efficiently. Please never parse HTML using regex.
Parsing HTML using regex will drive SO users insane this way: https://stackoverflow.com/a/1732454/5909136
来源:https://stackoverflow.com/questions/43066317/php-regex-skip-link-tags-when-rel-canonical