Google Analytics Regex - Alternative to no negative lookahead

问题

Google Analytics does not allow negative lookahead anymore within its filters. This is proving to be very difficult to create a custom report only including the links I would like it to include.

The regex that includes negative lookahead that would work if it was enabled is:

test.com(\/\??index\_(.*)\.php\??(.*)|\/\?(.*)|\/|)+(\s)*(?!.)

This matches:

test.com
test.com/
test.com/index_fb2.php
test.com/index_fb2.php?ref=23
test.com/index_fb2.php?ref=23&e=35
test.com/?ref=23 
test.com/?ref=23&e=35

and does not match (as it should):

test.com/ambassadors
test.com/admin/?signup=true 
test.com/randomtext/

I am looking to find out how to adapt my regex to still hold the same matches but without the use of negative lookahead.

Thank you!

回答1:

Google Analytics doesn't seem to support single-line and multiline modes, which makes sense to me. URLs can't contain newlines, so it doesn't matter if the dot doesn't match them and there's never any need for ^ and $ to match anywhere but the beginning and end of the whole string.

That means the (?!.) in your regex is exactly equivalent to $, which matches only at the very end of the string (like \z, in flavors that support it). Since that's the only lookahead in your regex, you should never have have had this problem; you should have been using $ all along.

However, your regex has other problems, mostly owing to over-reliance on (.*). For example, it matches these strings:

test.com/?^#(%)!*%supercalifragilisticexpialidocious
test.com/index_ecky-ecky-ecky-ecky-PTANG!-vroop-boing_rowr.php (ni! shh!)

...which I'm pretty sure you don't want. :P

Try this regex:

test\.com(?:/(?:index_\w+\.php)?(?:\?ref=\d+(?:&e=\d+)?)?)?\s*$

or more readably:

test\.com
(?:
  /
  (?:index_\w+\.php)?
  (?:
    \?ref=\d+
    (?:
      &e=\d+
    )?
  )?
)?
\s*$

For illustration purposes I'm making a lot of simplifying assumptions about (e.g.) what parameters can be present, what order they'll appear in, and what their values can be. I'm also wondering if it's really necessary to match the domain (test.com). I have no experience with Google Analytics, but shouldn't the match start (and be anchored) right after domain? And do you really have to allow for whitespace at the end? It seems to me the regex should be more like this:

^/(?:index_\w+\.php)?(?:\?ref=\d+(?:&e=\d+)?)?$

回答2:

Firstly I think your regex needs some fixing. Let's look at what you have:

test.com(\/\??index_.*.php\??(.*)|\/\?(.*)|\/|)+(\s)*(?!.)

The case where you use the optional ? at the start of index... is already taken care of by the second alternative:

test.com(\/index_.*.php\??(.*)|\/\?(.*)|\/|)+(\s)*(?!.)

Now you probably only want the first (.*) to be allowed, if there actually was a literal ? before. Otherwise you will match test.com/index_fb2.phpanystringhereandyouprobablydon'twantthat. So move the corresponding optional marker:

test.com(\/index_.*.php(\?(.*))?|\/\?(.*)|\/|)+(\s)*(?!.)

Now .* consumes any character and as much as possible. Also, the . in front of php consumes any character. This means you would be allowing both test.com/index_fb2php and test.com/index_fb2.html?someparam=php. Let's make that a literal . and only allow non-question-mark characters:

test.com(\/index_[^?]*\.php(\?(.*))?|\/\?(.*)|\/|)+(\s)*(?!.)

Now the first and second and third option can be collapsed into one, if we make the file name optional, too:

test.com(\/(index_[^?]*\.php)?(\?(.*))?|)+(\s)*(?!.)

Finally, the + can be removed, because the (.*) inside can already take care of all possible repetitions. Also (something|) is the same as (something)?:

test.com(\/(index_[^?]*\.php)?(\?(.*))?)?(\s)*(?!.)

Seeing your input examples, this seems to be closer to what you actually want to match.

Then to answer your question. What (?!.) does depends on whether you use singleline mode or not. If you do, it asserts that you have reached the end of the string. In this case you can simply replace it by \Z, which always matches the end of the string. If you do not, then it asserts that you have reached the end of a line. In this case you can use $ but you need to also use multi-line mode, so that $ matches line-endings, too.

So, if you use singleline mode (which probably means you have only one URL per string), use this:

test.com(\/(index_[^?]*\.php)?(\?(.*))?)?(\s)*\Z

If you do not use singleline mode (which probably means you can have multiple URLs on their own lines), you should also use multiline mode and this kind of anchor instead:

test.com(\/(index_[^?]*\.php)?(\?(.*))?)?(\s)*$

来源：https://stackoverflow.com/questions/13361680/google-analytics-regex-alternative-to-no-negative-lookahead

标签

regex

google-analytics