Match Partially Duplicated Lines

问题

I have rows in a list that are sometimes similar up to the first "space" character, then can change (i.e. a date afterwards).

wsmith jul/12/12
bwillis jul/13/13
wsmith jul/14/12
tcruise jul/12/12

I can easily sort the lines, but I'd love to remove the duplicate later dated entry. I did find a regex suggestion, but it matches only exactly the same lines. I need to be able to mark the entire row of similar usernames in the file. In my example above, lines 1 and 3 would be highlighted.

(edited for clarity)

回答1:

A compact formula in the PCRE engine (used by Notepad++) to see if there is repetition from one row to another would be

(?m)^(\S+).*\R(?s).*?\K\1

This will work in N++.

As you remove duplicate lines, more may become marked, because initially the regex skips over the in-between lines in order to highlight the duplicate.

Explanation

(?m) turns on multi-line mode, allowing ^ and $ to match on each line
The ^ anchor asserts that we are at the beginning of the string
(\S+) captures non-space chars to Group 1
.* gets to the end of the line
\R line break
(?s) activates DOTALL mode, allowing the dot to match across lines
.*? lazily match chars up to ...
The \K tells the engine to drop what was matched so far from the final match it returns
\1 back-reference: match what Group 1 captured before.

回答2:

I propose this regex:

^(\S+) (?=(?s:.)*\1.*).*

It will mark the first users that have a duplicate.

regex101 demo

^          # Beginning of line
(\S+)      # Match and store non-spaces
           # One space
(?=        # Positive look-ahead begin
  (?s:.)*  # Match any character including newlines
  \1.*     # Match the matched group (i.e. the username) and anything following on same line
)          # End lookahead
.*         # Match anything remaining on line (mainly for the first match)

If notepad++ marked all capture groups, you would have been able to use this to highlight all duplicates including the last one:

^(\S+) (?=(?s:.)*(\1.*)).*

regex101 demo

But unfortunately (at least for v6.5.2), N++ doesn't mark the capture groups.

来源：https://stackoverflow.com/questions/24947409/match-partially-duplicated-lines

标签

regex

duplicates

notepad++