change string in file between two strings with character X [duplicate]

前端未结

关注

 2  1590

余生分开走 2020-12-21 11:46

2条回答

伪装坚强ぢ (楼主)

2020-12-21 12:15
Provisional solution

This extends the 'initial offering' below and handles cases 1, 2, 5, 6, 8, 9. It does not handle the case where there is one or more complete … entries and also a starting without the matching on the same line. Frankly, I'm not even sure how to start tackling that scenario.

The unhandled cases 3, 4, 7 are not valid XML — I'm not convinced they're valid HTML (or XHTML) either. I believe they can be handled by a similar (but simpler) mechanism to the one shown here for the full … version. I'm leaving that as an exercise for the reader (beware the < in the character class — it would need to become a /).

script.sed
```
//! b
/.*<\/Name>/{
: l1
s/$[[:space:]]*\(X[X[[:space:]]*$\{0,1\}\)[^X<[:space:]]$.*[[:space:]]*<\/Name>$/\1X\3/
t l1
b
}
//,/<\/Name>/{
  # Handle up to 4 lines to the end-name tag
  /<\/Name>/! N
  /<\/Name>/! N
  /<\/Name>/! N
  /<\/Name>/! N
# s/^/ZZ/; s/$/AA/p
# s/^ZZ//; s/AA$//
  : l2
  s/$[[:space:]]*\(X[X[[:space:]]*$\{0,1\}\)[^X<[:space:]]$.*[[:space:]]*<\/Name>$/\1X\3/
  t l2
}
```
The first line 'skips' processing of lines not containing (they get printed and the next line is read). The next 6 lines are the script from the 'initial offering' except that there's a b to jump to the end of processing.

The new section is the //,/<\/Name>/ code. This looks for on its own, and concatenates up to 4 lines until a is included in the pattern space. The two comment lines were used for debugging — they allowed me to see what was being treated as a unit. Except for the use of the label l2 in place of l1, the remainder is exactly the same as in the initial offering — sed regexes already accommodate newlines.

This is heavy-duty sed scripting and not what I'd want to use or maintain. I would go with a Perl solution using an XML parser (because I know Perl better than Python), but Python would do the job fine too with an appropriate XML parser.

data

A slightly extended data file.
```
 Jason 
Jim
 Jason Bourne 
 Elijah   Dennis 
 Elijah Wood   Dennis The Menace 
Elijah Wood Dennis The Menace
 Jason
        

    Jim

    Jim
        
 Jason
Bourne 
 
    Jason
        Bourne
            
 Elijah 

Dennis

 Elijah
Wood 
             Dennis
The Menace 
Elijah
Wood
    Dennis The
Menace



 Jason 
to
 XXXXX 

2. (see no space)

 Jim
 to
 XXX

3.

 
to 
`

4.


to


starting tag, value and closing tag can all come in different line

5.

Jim

to
XXX


6.


     Jim
       
to

     XXX
       

7.

  
to
  

8.

 Jason   Ignacio 
to
 XXXXX   XXXXXX 

9.

 Jason Ignacio 
to
 XXXXX XXXXXXX 
or
 XXXXXXXXXXXXX 
```
No claims are made that the data file contains a minimal set of cases; it is repetitious. It includes the material from the question, except that the 'unorthodox' XML elements like are converted into XML comments . The mapping actually isn't crucial; the opening part doesn't match (and the tail doesn't match ) so they'd not be processed anyway.

Output
```
$ sed -f script.sed data
 XXXXX 
XXX
 XXXXX XXXXXX 
 XXXXXX   XXXXXX 
 XXXXXX XXXX   XXXXXX XXX XXXXXX 
XXXXXX XXXX XXXXXX XXX XXXXXX
 XXXXX
        

    XXX

    XXX
        
 XXXXX
XXXXXX 
 
    XXXXX
        XXXXXX
            
 XXXXXX 

XXXXXX

 XXXXXX
XXXX 
             XXXXXX
XXX XXXXXX 
XXXXXX
XXXX
    XXXXXX XXX
XXXXXX



 XXXXX 
to
 XXXXX 

2. (see no space)

 XXX
 to
 XXX

3.

 
to 
`

4.


to


starting tag, value and closing tag can all come in different line

5.

XXX

to
XXX


6.


     XXX
       
to

     XXX
       

7.

  
to
  

8.

 XXXXX   XXXXXXX 
to
 XXXXX   XXXXXX 

9.

 XXXXX XXXXXXX 
to
 XXXXX XXXXXXX 
or
 XXXXXXXXXXXXX 
$
```
Initial offering

A partial answer — but it illustrates the problems you face. Dealing with cases 1 & 2 in the question, plus the multi-word variations, you can use the script:

script.sed
```
/.*<\/Name>/{
: l1
s/$[[:space:]]*\(X[X[[:space:]]*$\{0,1\}\)[^X<[:space:]]$.*[[:space:]]*<\/Name>$/\1X\3/
t l1
}
```
That is pretty contorted, to be polite about it. It looks for followed by zero or more spaces. That can be followed by $X[X[[:space:]]*$\{0,1\}, which means zero or one occurrences of an X followed by a sequence of X's or spaces. All of that is captured as \1 in the replacement. Then there's a single character that isn't an X, < or space, followed by zero or more any characters, zero or more spaces, and . The single character in the middle is replaced by an X. The whole replacement is repeated until there are no more matches via the label : l1 and the conditional branch t l1. All that operates only on a line with both and .

data
```
 Jason 
Jim
 Jason Bourne 
 Elijah   Dennis 
 Elijah Wood   Dennis The Menace 
Elijah Wood Dennis The Menace
 Jason


Jim
 Jason
Bourne 
 Elijah   Dennis

 Elijah
Wood   Dennis
The Menace 
Elijah
Wood Dennis The
Menace
```
Output
```
$ sed -f script.sed data
 XXXXX 
XXX
 XXXXX XXXXXX 
 XXXXXX   XXXXXX 
 XXXXXX XXXX   XXXXXX XXX XXXXXX 
XXXXXX XXXX XXXXXX XXX XXXXXX
 Jason


Jim
 Jason
Bourne 
 XXXXXX   Dennis

 Elijah
Wood   Dennis
The Menace 
Elijah
Wood Dennis The
Menace
$
```
Note the replacement part way through the end. That line is going to cause headaches for anything more.

I've not worked out how the script would handle the various split-line cases, beyond it would almost certainly need to join lines until the is caught. It would then do processing closely related to that already shown, but it would need to allow for newlines in the matched material.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

热议问题