问题
I want to remove a specific pattern which starts with either \( or with \\( and ends with /) or //). There may or may not be a space before and after the pattern i.e the pattern can be in the starting or on the end too.
But the real problem is that There is very useful data inside a child pattern which starts with \text { preserve this data } too and I want to preserve that thing.
For example:
this is my text \( delete it x+y I do not care \text { Preserve this } whatever is here I do not care \text {preserve this also} \) this is outside text
SO the result should be something like :
this is my text Preserve this preserve this also this is outside text
Basically this is MathPix markdown and I want to remove it except \text. I can remove these \tags via using
s = re.sub(r"\\[a-z]{3,}",' ',s)
and then can use \\text {(.*?)\} to find the \text { asdas } ( but I do not know how to recover/keep it)
but it'll create a problem for me that apart from the tags, there is a lot of garbage data inside which will be unidentifiable for me later. I can run a loop where I see \( or \\( and then an inner loop for \text { but the problem is there can be any number of \text. It'll be very hard for me to do it.
I have a JAVA code which my friend suggested for me but I do not know what would be the Python equivalent for it and also I have not used this on corner cases. The Java code is something like:
Pattern.compile("(?=((\\\\text \\{)(.*?)(\\})))")
I'll really appreciate any help. I have little or no experience with groups and literally no idea about how to preserve inner things like this.
EDIT: A very typical example would be:
\( \begin{array}{ll}\text { Set A } & \text { Set B } \ \text { 1. Adenine } & \text { a. } C_{5} N_{5} H_{5} O \ \text { 2. Guanine } & \text { b. } C_{4} N_{2} H_{4} O_{2} \ \text { 3. Uracil } & \text { c. } C_{5} N_{5} H_{5} \ \text { 4. Thymine } & \text { d. } C_{5} N_{2} H_{6} O_{2}\end{array} ) ( \mathbf{A} ) ( 1-c ; 2-a ; 3-d ; 4-b ) B. ( 1-c ; 2-b ; 3-d ; 4-e ) c. ( 1-b ; 2-c ; 3-d ; 4-a ) D. ( 1-c ; 2-a ; 3-b ; 4-d \)
or
\( \begin{array}{ll}\text { 34. Climbing roots occur in } & \text { [APMEE 1996; CBSE PMT 1999] }\end{array} \)
or
\( \begin{array}{ll}\text { 21. Mesophyll is usually differentiated in } & \text { ICBSE'02] }\end{array} \)
回答1:
You can use
re.sub(r'\s*\\+\((.*?)\\+\)', lambda x: " ".join(re.findall(r'\\[a-z]{3,}\s*{([^{}]*)}', x.group(1))), s)
The first expression finds
\s*- 0+ whitespaces\\+\(- 1+\chars and then a((.*?)- Group 1: any zero or more chars other than line break chars, as few as possible\\+\)- 1+\chars and then a).
The second expression finds the following pattern matches in the found Group 1 matches:
\\- a\char[a-z]{3,}- three or more lowercase ASCII letters\s*- 0+ whitespaces{- a{char([^{}]*)- Group 1: zero or more chars other than{and}}- a}char.
All Group 1 matches found are joined with a space and this is the replacement for the outer re.sub.
See a Python demo:
import re
s = r'''this is my text \( delete it x+y I do not care \text { Preserve this } whatever is here I do not care \text {preserve this also} \) this is outside text'''
print( re.sub(r'\s*\\+\((.*?)\\+\)', lambda x: " ".join(re.findall(r'\\[a-z]{3,}\s*{([^{}]*)}', x.group(1))), s) )
# => this is my text Preserve this preserve this also this is outside text
s = r'''\( \begin{array}{ll}\text { Set A } & \text { Set B } \\ \text { 1. Adenine } & \text { a. } C_{5} N_{5} H_{5} O \\ \text { 2. Guanine } & \text { b. } C_{4} N_{2} H_{4} O_{2} \\ \text { 3. Uracil } & \text { c. } C_{5} N_{5} H_{5} \\ \text { 4. Thymine } & \text { d. } C_{5} N_{2} H_{6} O_{2}\end{array} \) \( \mathbf{A} \) \( 1-c ; 2-a ; 3-d ; 4-b \) B. \( 1-c ; 2-b ; 3-d ; 4-e \) c. \( 1-b ; 2-c ; 3-d ; 4-a \) D. \( 1-c ; 2-a ; 3-b ; 4-d \)'''
print( re.sub(r'\s*\\+\((.*?)\\+\)', lambda x: " ".join(re.findall(r'\\[a-z]{3,}\s*{([^{}]*)}', x.group(1))), s) )
# => array Set A Set B 1. Adenine a. 2. Guanine b. 3. Uracil c. 4. Thymine d. arrayA B. c. D.
来源:https://stackoverflow.com/questions/64787449/regex-to-remove-the-whole-outer-parent-pattern-but-still-preserving-the-data-ins