Deleting duplicate values using find and replace in a text editor

问题

I messed something up. In my xml, each non preferred term has a preferred term to use: Something I have done has created some non preffered terms where the preferred term to use is the exact same name as this non preferred term.

<term>
<termId>127699289611384833453kNgWuDxZEK37Lo4QVWZ</termId>
<termUpdate>Add</termUpdate>
<termName>Adenosquamous Carcinoma</termName>
<termType>Nd</termType>
<termStatus>Active</termStatus>
<termApproval>Approved</termApproval>
<termCreatedDate>20110704T09:41:31</termCreatedDatae>
<termCreatedBy>admin</termCreatedBy>
<termModifiedDate>20110704T09:45:17</termModifiedDate>
<termModifiedBy>admin</termModifiedBy>
<relation>
  <relationType>USE</relationType>
  <termId>1276992897N1537166632rbr7BISWAI93SarY118G</termId>
  <termName>Adenosquamous Carcinoma</termName>
</relation>

Is there a text editor with a find and replace function I can use to tell it that if the in =the of the actual term, to just delete the whole ? I looked at the related queries and they mentioned regular expressions, but I've spent ages trying to build them and they are beyond me, thanks!

回答1:

It is nearly 3 years too late answering this question, but there are Perl regular expressions which can be indeed used for this task.

Finding and deleting a term block containing same termName in relation as defined above for the term itself is possible with UltraEdit for Windows v21.10.0.1032 and most likely also with other text editors supporting Perl regular expression using a case-sensitive Perl regular expression Replace with search string:

^[ \t]*<term>(?:(?!</term>)[\S\s])+<termName>([^\r\n]+?)</termName>(?:(?!</term>)[\S\s])+<relation>(?:(?!</term>)[\S\s])+<termName>\1</termName>(?:(?!</term>)[\S\s])+</term>[ \t\r]*\n

The replace string is an empty string.

Explanation:

^ ... start every search at beginning of a line.

[ \t]* ... there can be 0 or more spaces or tabs at beginning of the line.

<term> ... this string must be found next on the line.

Next the tricky expression follows which is required to match any character up to next string of interest, but with avoiding matching something in next term block if the remaining expression does not return a positive result on current term block.

(?:(?!</term>)[\S\s])+ ... this expression finds any character because of [\S\s] matching any non whitespace character or any whitespace character. There must be at least 1 character before next fixed string because of the +, but it can be also more characters. Additionally the Perl regular expression must make look ahead on every character matched to check if NOT </term> follows. If right of the currently matched character there is the string </term>, the Perl regexp engine must stop matching any character at current position in stream and continue with next part of the search string. So this expression can match any character, but not beyond </term> and therefore only characters between <term> and </term>. Because of ?: nothing is captured/marked for back referencing by this expression.

<termName> ... this fixed string within a term block must be found next.

([^\r\n]+?) ... matches the characters of the name of the term and captures/marks this string for back referencing. Instead of the negative character class expression [^\r\n], it would be also possible to use another class definition, or just . if a dot does not match new line characters. Also possible would be ([^<]+) if it is not possible that a not encoded opening angle bracket is part of the term name. Character < must be encoded with < according to XML specification within an element's value except within a CDATA block.

</termName> ... this fixed string within a term block must be found next.

(?:(?!</term>)[\S\s])+ ... again any character within a term block up to next fixed string.

<relation> ... this fixed string within a term block must be found next.

(?:(?!</term>)[\S\s])+ ... again any character within a term block up to next fixed string.

<termName> ... this fixed string within a term block must be found next.

\1 ... this expression back references the captured/marked term name and therefore the next string must be the same as the name of the term defined above.

</termName> ... this fixed string within a term block must be found next.

(?:(?!</term>)[\S\s])+ ... again any character within a term block up to next fixed string.

</term> ... this fixed string marking end of a term block must be found next.

[ \t\r]*\n ... matches 0 or more spaces, tabs and carriage returns and next a line-feed. So this expression works for a DOS/Windows (CR+LF) and a Unix (only LF) text file.

Also possible with UltraEdit is:

(?s)^[ \t]*<term>(?:(?!</term>).)+<termName>([^<]+?)</termName>(?:(?!</term>).)+<relation>(?:(?!</term>).)+<termName>\1</termName>(?:(?!</term>).)+</term>[ \t\r]*\n

(?s) ... this expression at beginning of the search string changes the behavior of . from matching any character except line terminators to really any character and therefore . is now like [\S\s].

来源：https://stackoverflow.com/questions/6595295/deleting-duplicate-values-using-find-and-replace-in-a-text-editor

标签

text

duplicates

editor