Match specific pattern and print just the matched string in the previous line: Updated

若如初见. 提交于 2020-01-25 08:48:05

问题


I have a .fastq file formatted in the following way

@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 (name)
AACATCTACATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTCAGTATGTAGGACTGTAACATAG (sequence)
+ 
GGACCCGGGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFFGGGGGGFGGGGGGGGGGGFGG (quality)

For each sequence the format is the same (repetition of 4 lines) What I am trying to do is searching for specific regex pattern in a window of n=35 characters of the 2nd line, cut it if found and report it at the end of the previous line.

So far, Dr. Norton shared a nice script to search for the regexp, extract it and reported at the header of the read (1st line) Unfortubately, I am not able to extract a string if it is located at the end of the line since indexes (in particular RSTART) are wrong.

The code reported do the job when searching for FtgtRegexp ([A-Z]{5}ACA[A-Z]{5}ACA[A-Z]{5}

BEGIN {
    FtgtRegexp = "[A-Z]{5}ACA[A-Z]{5}ACA[A-Z]{5}"
    winLgth   = 35
    numLines  = 4
}
{
    lineNr = ( (NR-1) % numLines ) + 1
    rec[lineNr] = $0
}
lineNr == numLines {
    if ( match(substr(rec[2],1,winLgth),tgtRegexp) ) {
        rec[1] = rec[1] " " substr(rec[2],RSTART,RLENGTH)
        rec[2] = substr(rec[2],RSTART+RLENGTH)
        rec[4] = substr(rec[4],RSTART+RLENGTH)
    }
    for ( lineNr=1; lineNr<=numLines; lineNr++ ) {
        print rec[lineNr]
    }
}

input:

@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 
AACATCTACATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTCAGTATGTAGGACTGTAACATAG 
+ 
GGACCCGGGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFFGGGGGGFGGGGGGGGGGGFGG 

output:

@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 CATCTACATATTCACATATAG
ACATGAAACACCTGTGGTTCTTCCTCAGTATGTAGGACTGTAACATAG 
+ 
GGGFGGGGGGGGGGGFGGGGFGFGFFFGGGGGGFGGGGGGGGGGGFGG

If I try to slightly modify the script to extract a second regexp "RtgtRegexp" located at the end of the second line it gives me the wrong output since it reports a wrong RSTART for the matching:

BEGIN {
    FtgtRegexp = "[A-Z]{5}ACA[A-Z]{5}ACA[A-Z]{5}"
    RtgtRegexp = "[A-Z]{5}TGT[A-Z]{5}TGT[A-Z]{5}"
    winLgth   = 35
    numLines  = 4
}
{
    lineNr = ( (NR-1) % numLines ) + 1
    rec[lineNr] = $0
}
lineNr == numLines {
    if ( match(substr(rec[2],(length(rec[2])-winLgth+1),winLgth),tgtRegexp) ) {
        rec[1] = rec[1] " " substr(rec[2],RSTART,RLENGTH)
        rec[2] = substr(rec[2],RSTART+RLENGTH)
        rec[4] = substr(rec[4],RSTART+RLENGTH)
    }
    for ( lineNr=1; lineNr<=numLines; lineNr++ ) {
        print rec[lineNr]
    }
}

input:

@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 
AACATCTACATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTCAGTATGTAGGACTGTAACATAG 
+ 
GGACCCGGGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFFGGGGGGFGGGGGGGGGGGFGG 

desired output:

@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 CAGTATGTAGGACTGTAACAT
AACATCTACATATTCACATATAGACATGAAACACCTGTGGTTCTTCCT 
+ 
GGACCCGGGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGF 

actual output

@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 ATTCACATATAGACATGAAAC
ACCTGTGGTTCTTCCTCAGTATGTAGGACTGTAACATAG 
+ 
GGGGGGFGGGGFGFGFFFGGGGGGFGGGGGGGGGGGFGG

来源:https://stackoverflow.com/questions/58707588/match-specific-pattern-and-print-just-the-matched-string-in-the-previous-line-u

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!