Awk matching between two files when regions intersect (any solutions welcome)

依然范特西╮ 提交于 2019-12-10 23:24:22

问题


This is building upon an earlier question Awk conditional filter one file based on another (or other solutions)

Quick summary at bottom of question

I have an awk program that outputs a column from rows in a text file 'refGene.txt if values in that row match 2 out of 3 values in another text file.

I need to include an additional criteria for finding a match between the two files. The criteria is inclusion if the range of the 2 numberical values specified in each row in file 1 overlap with the range of the two values in a row in refGene.txt. An example of a line in File 1:

chr1 10 20
chr2 10 20

and an example line in file 2(refGene.txt) of the matching columns ($3, $5, $ 6):

chr1 5 30

Currently the awk program does not treat this as a match because although the first column matches neither the 2nd or 3rd columns do no. But I would like a way to treat this as a match because the region 10-20 in file 1 is WITHIN the range of 5-30 in refGene.txt. However the second line in file 1 should NOT match because the first column does not match, which is necessary. If there is a way to include cases when any of the range in file 1 overlaps with any of the range in refGene.txt that would be really helpful (so partial overlap is also counted as a match). It should also replace the below conditional statements as it would also find all the cases currently described below.

So a summary: Want awk to print a match if: $1 in file1 matches $3 in file 2 AND: The range of $2-$3 in file1 intersects at all with the range of $5-$6 in file2

Please let me know if my question is unclear. Any help is really appreciated, thanks it advance! (solutions do not have to be in awk)

Rubal

FILES=/files/*txt   
for f in $FILES ;
do

    awk '
        BEGIN {
            FS = "\t";
        }
        FILENAME == ARGV[1] {
            pair[ $1, $2, $3 ] = 1;
            next;
        }
        {
            if ( pair[ $3, $5, $6 ] == 1 ) {
                print $13;
            }
        }
    ' $(basename $f) /files/refGene.txt > /files/results/$(basename $f) ;
done

回答1:


You just need to use 2 arrays:

awk -F '\t' '
  NR == FNR {min[$1] = $2; max[$1] = $3; next}
  ($3 in min) && (min[$3] >= $5) && (max[$3] <= $6) {print $13}
'

NR==FNR is just another way to write FILENAME == ARGV[1] -- it looks at line numbers instead of filenames.



来源:https://stackoverflow.com/questions/12730800/awk-matching-between-two-files-when-regions-intersect-any-solutions-welcome

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!