This question occurs repeatedly in many forms with many different multi-character delimiters and so IMHO is worth a canonical answer.
Given an input file like:
My (current version of) solution approaches the problem from the front so the output is not exactly the same:
.. 1 # second
.. a<2 .. .. # first in my approach
.. @{<>}@ # fourth
.. 4 .. .. # third
.. 5 .. # fifth
if the program would traverse the arrays arr
and seps
backwards, the output would be the same (probably), but I just ran out of time temporarily.
In Gnu awk (for using split
with four params to parse the data).
EDIT For compatibility with others than Gnu awk I added function gsplit()
which is a crude Gnu awk split
replacement.
$ cat program.awk
{ data=data $0 } # append all records to one var
END {
n=gsplit(data, arr, "?foo>", seps) # split by every tag
for(i=1;i<=n;i++) { # atm iterate arrays from front to back
if(seps[i]=="") # if element opening tag
stack[++j]=seps[i] arr[i+1] # store tag ang wait for closing tag
else {
stack[j]=stack[j] (seps[i]==prev ? arr[i] : "")
print stack[j--] seps[i]
}
prev = seps[i]
}
}
# elementary gnu awk split compatible replacement
function gsplit(str, arr, pat, seps, i) {
delete arr; delete seps; i=0
while(match(str, pat)) {
arr[++i]=substr(str,1,(RSTART-1))
seps[i]=substr(str,RSTART,RLENGTH)
str=substr(str,(RSTART+RLENGTH))
}
arr[++i]=substr(str,1)
return i
}
Run it:
$ awk -f program.awk file
.. a<2 ..
.. 1 ..
.. 4 ..
.. @{<>}@ ..
.. 5 ..