Provisional solution
This extends the 'initial offering' below and handles cases 1, 2, 5, 6, 8, 9. It does not handle the case where there is one or more complete …
entries and also a starting
without the matching
on the same line. Frankly, I'm not even sure how to start tackling that scenario.
The unhandled cases 3, 4, 7 are not valid XML — I'm not convinced they're valid HTML (or XHTML) either. I believe they can be handled by a similar (but simpler) mechanism to the one shown here for the full …
version. I'm leaving that as an exercise for the reader (beware the <
in the character class — it would need to become a /
).
script.sed
//! b
/.*<\/Name>/{
: l1
s/\([[:space:]]*\(X[X[[:space:]]*\)\{0,1\}\)[^X<[:space:]]\(.*[[:space:]]*<\/Name>\)/\1X\3/
t l1
b
}
//,/<\/Name>/{
# Handle up to 4 lines to the end-name tag
/<\/Name>/! N
/<\/Name>/! N
/<\/Name>/! N
/<\/Name>/! N
# s/^/ZZ/; s/$/AA/p
# s/^ZZ//; s/AA$//
: l2
s/\([[:space:]]*\(X[X[[:space:]]*\)\{0,1\}\)[^X<[:space:]]\(.*[[:space:]]*<\/Name>\)/\1X\3/
t l2
}
The first line 'skips' processing of lines not containing
(they get printed and the next line is read). The next 6 lines are the script from the 'initial offering' except that there's a b
to jump to the end of processing.
The new section is the //,/<\/Name>/
code. This looks for
on its own, and concatenates up to 4 lines until a
is included in the pattern space. The two comment lines were used for debugging — they allowed me to see what was being treated as a unit. Except for the use of the label l2
in place of l1
, the remainder is exactly the same as in the initial offering — sed
regexes already accommodate newlines.
This is heavy-duty sed
scripting and not what I'd want to use or maintain. I would go with a Perl solution using an XML parser (because I know Perl better than Python), but Python would do the job fine too with an appropriate XML parser.
data
A slightly extended data file.
Jason
Jim
Jason Bourne
Elijah Dennis
Elijah Wood Dennis The Menace
Elijah Wood Dennis The Menace
Jason
Jim
Jim
Jason
Bourne
Jason
Bourne
Elijah
Dennis
Elijah
Wood
Dennis
The Menace
Elijah
Wood
Dennis The
Menace
Jason
to
XXXXX
2. (see no space)
Jim
to
XXX
3.
to
`
4.
to
starting tag, value and closing tag can all come in different line
5.
Jim
to
XXX
6.
Jim
to
XXX
7.
to
8.
Jason Ignacio
to
XXXXX XXXXXX
9.
Jason Ignacio
to
XXXXX XXXXXXX
or
XXXXXXXXXXXXX
No claims are made that the data
file contains a minimal set of cases; it is repetitious. It includes the material from the question, except that the 'unorthodox' XML elements like
are converted into XML comments
. The mapping actually isn't crucial; the opening part doesn't match
(and the tail doesn't match
) so they'd not be processed anyway.
Output
$ sed -f script.sed data
XXXXX
XXX
XXXXX XXXXXX
XXXXXX XXXXXX
XXXXXX XXXX XXXXXX XXX XXXXXX
XXXXXX XXXX XXXXXX XXX XXXXXX
XXXXX
XXX
XXX
XXXXX
XXXXXX
XXXXX
XXXXXX
XXXXXX
XXXXXX
XXXXXX
XXXX
XXXXXX
XXX XXXXXX
XXXXXX
XXXX
XXXXXX XXX
XXXXXX
XXXXX
to
XXXXX
2. (see no space)
XXX
to
XXX
3.
to
`
4.
to
starting tag, value and closing tag can all come in different line
5.
XXX
to
XXX
6.
XXX
to
XXX
7.
to
8.
XXXXX XXXXXXX
to
XXXXX XXXXXX
9.
XXXXX XXXXXXX
to
XXXXX XXXXXXX
or
XXXXXXXXXXXXX
$
Initial offering
A partial answer — but it illustrates the problems you face. Dealing with cases 1 & 2 in the question, plus the multi-word variations, you can use the script:
script.sed
/.*<\/Name>/{
: l1
s/\([[:space:]]*\(X[X[[:space:]]*\)\{0,1\}\)[^X<[:space:]]\(.*[[:space:]]*<\/Name>\)/\1X\3/
t l1
}
That is pretty contorted, to be polite about it. It looks for
followed by zero or more spaces. That can be followed by \(X[X[[:space:]]*\)\{0,1\}
, which means zero or one occurrences of an X followed by a sequence of X's or spaces. All of that is captured as \1
in the replacement. Then there's a single character that isn't an X
, <
or space, followed by zero or more any characters, zero or more spaces, and
. The single character in the middle is replaced by an X. The whole replacement is repeated until there are no more matches via the label : l1
and the conditional branch t l1
. All that operates only on a line with both
and
.
data
Jason
Jim
Jason Bourne
Elijah Dennis
Elijah Wood Dennis The Menace
Elijah Wood Dennis The Menace
Jason
Jim
Jason
Bourne
Elijah Dennis
Elijah
Wood Dennis
The Menace
Elijah
Wood Dennis The
Menace
Output
$ sed -f script.sed data
XXXXX
XXX
XXXXX XXXXXX
XXXXXX XXXXXX
XXXXXX XXXX XXXXXX XXX XXXXXX
XXXXXX XXXX XXXXXX XXX XXXXXX
Jason
Jim
Jason
Bourne
XXXXXX Dennis
Elijah
Wood Dennis
The Menace
Elijah
Wood Dennis The
Menace
$
Note the replacement part way through the end. That line is going to cause headaches for anything more.
I've not worked out how the script would handle the various split-line cases, beyond it would almost certainly need to join lines until the
is caught. It would then do processing closely related to that already shown, but it would need to allow for newlines in the matched material.