问题
I have a .fastq file formatted in the following way
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 (name)
AACATCTACATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTCAGTATGTAGGACTGTAACATAG (sequence)
+
GGACCCGGGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFFGGGGGGFGGGGGGGGGGGFGG (quality)
For each sequence the format is the same (repetition of 4 lines) What I am trying to do is searching for specific regex pattern in a window of n=35 characters of the 2nd line, cut it if found and report it at the end of the previous line.
So far, Dr. Norton shared a nice script to search for the regexp, extract it and reported at the header of the read (1st line) Unfortubately, I am not able to extract a string if it is located at the end of the line since indexes (in particular RSTART) are wrong.
The code reported do the job when searching for FtgtRegexp ([A-Z]{5}ACA[A-Z]{5}ACA[A-Z]{5}
BEGIN {
FtgtRegexp = "[A-Z]{5}ACA[A-Z]{5}ACA[A-Z]{5}"
winLgth = 35
numLines = 4
}
{
lineNr = ( (NR-1) % numLines ) + 1
rec[lineNr] = $0
}
lineNr == numLines {
if ( match(substr(rec[2],1,winLgth),tgtRegexp) ) {
rec[1] = rec[1] " " substr(rec[2],RSTART,RLENGTH)
rec[2] = substr(rec[2],RSTART+RLENGTH)
rec[4] = substr(rec[4],RSTART+RLENGTH)
}
for ( lineNr=1; lineNr<=numLines; lineNr++ ) {
print rec[lineNr]
}
}
input:
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
AACATCTACATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTCAGTATGTAGGACTGTAACATAG
+
GGACCCGGGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFFGGGGGGFGGGGGGGGGGGFGG
output:
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 CATCTACATATTCACATATAG
ACATGAAACACCTGTGGTTCTTCCTCAGTATGTAGGACTGTAACATAG
+
GGGFGGGGGGGGGGGFGGGGFGFGFFFGGGGGGFGGGGGGGGGGGFGG
If I try to slightly modify the script to extract a second regexp "RtgtRegexp" located at the end of the second line it gives me the wrong output since it reports a wrong RSTART for the matching:
BEGIN {
FtgtRegexp = "[A-Z]{5}ACA[A-Z]{5}ACA[A-Z]{5}"
RtgtRegexp = "[A-Z]{5}TGT[A-Z]{5}TGT[A-Z]{5}"
winLgth = 35
numLines = 4
}
{
lineNr = ( (NR-1) % numLines ) + 1
rec[lineNr] = $0
}
lineNr == numLines {
if ( match(substr(rec[2],(length(rec[2])-winLgth+1),winLgth),tgtRegexp) ) {
rec[1] = rec[1] " " substr(rec[2],RSTART,RLENGTH)
rec[2] = substr(rec[2],RSTART+RLENGTH)
rec[4] = substr(rec[4],RSTART+RLENGTH)
}
for ( lineNr=1; lineNr<=numLines; lineNr++ ) {
print rec[lineNr]
}
}
input:
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
AACATCTACATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTCAGTATGTAGGACTGTAACATAG
+
GGACCCGGGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFFGGGGGGFGGGGGGGGGGGFGG
desired output:
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 CAGTATGTAGGACTGTAACAT
AACATCTACATATTCACATATAGACATGAAACACCTGTGGTTCTTCCT
+
GGACCCGGGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGF
actual output
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 ATTCACATATAGACATGAAAC
ACCTGTGGTTCTTCCTCAGTATGTAGGACTGTAACATAG
+
GGGGGGFGGGGFGFGFFFGGGGGGFGGGGGGGGGGGFGG
来源:https://stackoverflow.com/questions/58707588/match-specific-pattern-and-print-just-the-matched-string-in-the-previous-line-u