Match Partially Duplicated Lines

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-25 07:06:02

问题


I have rows in a list that are sometimes similar up to the first "space" character, then can change (i.e. a date afterwards).

wsmith jul/12/12
bwillis jul/13/13
wsmith jul/14/12
tcruise jul/12/12

I can easily sort the lines, but I'd love to remove the duplicate later dated entry. I did find a regex suggestion, but it matches only exactly the same lines. I need to be able to mark the entire row of similar usernames in the file. In my example above, lines 1 and 3 would be highlighted.

(edited for clarity)


回答1:


A compact formula in the PCRE engine (used by Notepad++) to see if there is repetition from one row to another would be

(?m)^(\S+).*\R(?s).*?\K\1

This will work in N++.

As you remove duplicate lines, more may become marked, because initially the regex skips over the in-between lines in order to highlight the duplicate.

Explanation

  • (?m) turns on multi-line mode, allowing ^ and $ to match on each line
  • The ^ anchor asserts that we are at the beginning of the string
  • (\S+) captures non-space chars to Group 1
  • .* gets to the end of the line
  • \R line break
  • (?s) activates DOTALL mode, allowing the dot to match across lines
  • .*? lazily match chars up to ...
  • The \K tells the engine to drop what was matched so far from the final match it returns
  • \1 back-reference: match what Group 1 captured before.



回答2:


I propose this regex:

^(\S+) (?=(?s:.)*\1.*).*

It will mark the first users that have a duplicate.

regex101 demo


^          # Beginning of line
(\S+)      # Match and store non-spaces
           # One space
(?=        # Positive look-ahead begin
  (?s:.)*  # Match any character including newlines
  \1.*     # Match the matched group (i.e. the username) and anything following on same line
)          # End lookahead
.*         # Match anything remaining on line (mainly for the first match)

If notepad++ marked all capture groups, you would have been able to use this to highlight all duplicates including the last one:

^(\S+) (?=(?s:.)*(\1.*)).*

regex101 demo

But unfortunately (at least for v6.5.2), N++ doesn't mark the capture groups.



来源:https://stackoverflow.com/questions/24947409/match-partially-duplicated-lines

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!