How to compare 2 lists of ranges in bash?

问题

Using bash script (Ubuntu 16.04), I'm trying to compare 2 lists of ranges: does any number in any of the ranges in file1 coincide with any number in any of the ranges in file2? If so, print the row in the second file. Here I have each range as 2 tab-delimited columns (in file1, row 1 represents the range 1-4, i.e. 1, 2, 3, 4). The real files are quite big.

file1:

file2:

3 4 
8 13 
20 24

Desired output:

3 4 
8 13

My best attempt has been:

awk 'NR=FNR { x[$1] = $1+0; y[$2] = $2+0; next}; 
{for (i in x) {if (x[i] > $1+0); then
{for (i in y) {if (y[i] <$2+0); then            
{print $1, $2}}}}}' file1 file2 > output.txt

This returns an empty file.

I'm thinking that the script will need to involve range comparisons using if-then conditions and iterate through each line in both files. I've found examples of each concept, but can't figure out how to combine them.

Any help appreciated!

回答1:

It depends on how big your files are, of course. If they are not big enough to exhaust the memory, you can try this 100% bash solution:

declare -a min=() # array of lower bounds of ranges
declare -a max=() # array of upper bounds of ranges

# read ranges in second file, store then in arrays min and max
while read a b; do
    min+=( "$a" );
    max+=( "$b" );
done < file2

# read ranges in first file    
while read a b; do
    # loop over indexes of min (and max) array
    for i in "${!min[@]}"; do
        if (( max[i] >= a && min[i] <= b )); then # if ranges overlap
            echo "${min[i]} ${max[i]}" # print range
            unset min[i] max[i]        # performance optimization
        fi
    done
done < file1

This is just a starting point. There are many possible performance / memory footprint improvements. But they strongly depend on the sizes of your files and on the distributions of your ranges.

EDIT 1: improved the range overlap test.

EDIT 2: reused the excellent optimization proposed by RomanPerekhrest (unset already printed ranges from file2). The performance should be better when the probability that ranges overlap is high.

EDIT 3: performance comparison with the awk version proposed by RomanPerekhrest (after fixing the initial small bugs): awk is between 10 and 20 times faster than bash on this problem. If performance is important and you hesitate between awk and bash, prefer:

awk 'NR == FNR { a[FNR] = $1; b[FNR] = $2; next; }
    { for (i in a)
          if ($1 <= b[i] && a[i] <= $2) {
              print a[i], b[i]; delete a[i]; delete b[i];
          } 
    }' file2 file1

回答2:

awk solution:

awk 'NR==FNR{ a[$1]=$2; next }
     { for(i in a) 
           if (($1>=i+0 && $1<=a[i]) || ($2<=a[i] && $2>=i+0)) { 
               print i,a[i]; delete a[i];
           } 
     }' file2 file1

The output:

3 4
8 13

回答3:

awk 'FNR == 1 && NR == 1 { file=1 } FNR == 1 && NR != 1 { file=2 } file ==1 { for (q=1;q<=NF;q++) { nums[$q]=$0} } file == 2 { for ( p=1;p<=NF;p++) { for (i in nums) { if (i == $p) { print $0 } } } }' file1 file2

Break down:

FNR == 1 && NR == 1 { 
                  file=1 
                  }
FNR == 1 && NR != 1 { 
                  file=2 
                  }
file == 1 { 
           for (q=1;q<=NF;q++) { 
                      nums[$q]=$0
                } 
          }
file == 2 {
      for ( p=1;p<=NF;p++) {
         for (i in nums) {
             if (i == $p) {
                      print $0
             }
          }
      }
}

Basically we set file = 1 when we are processing the first file and file = 2 when we are processing the second file. When we are in the first file, read the line into an array keyed on each field of the line. When we are in the second file, process the array (nums) and check if there is an entry for each field on the line. If there is, print it.

回答4:

For GNU awk as I'm controlling the for scanning order for optimizing time:

$ cat program.awk
BEGIN {
    PROCINFO["sorted_in"]="@ind_num_desc"
}
NR==FNR {                                         # hash file1 to a
    if(($2 in a==0) || $1<a[$2])                  # avoid collisions
        a[$2]=$1
    next
}
{
    for(i in a) {                                 # in desc order
        # print "DEBUG: For:",$0 ":", a[i], i     # remove # for debug
        if(i+0>$1) {                              # next after
            if($1<=i+0 && a[i]<=$2) {
                print
                next
            }
        }
        else
            next
    }
}

Test data:

$ cat file1
0 3 # testing for completely overlapping ranges
1 4
5 7 
8 11
12 15
$ cat file2
1 2 # testing for completely overlapping ranges
3 4 
8 13 
20 24

Output:

$ awk -f program.awk file1 file2
1 2
3 4 
8 13

and

$ awk -f program.awk file2 file1
0 3
1 4
8 11
12 15

回答5:

If Perl solution is preferred, then below one-liner would work

/tmp> cat marla1.txt
1 4
5 7
8 11
12 15
/tmp> cat marla2.txt
3 4
8 13
20 24
/tmp> perl -lane ' BEGIN { %kv=map{split(/\s+/)} qx(cat marla2.txt) } { foreach(keys %kv) { if($F[0]==$_ or $F[1]==$kv{$_}) { print "$_ $kv{$_}" }} } ' marla1.txt
3 4
8 13
/tmp>

回答6:

If the ranges are ordered according to their lower bounds, we can use this to make the algorithms more efficient. The idea is to alternately proceed through the ranges in file1 and file2. More precisely, when we have a certain range R in file2, we take further and further ranges in file1 until we know whether these overlap with R. Once we know this, we switch to the next range in file2.

#!/bin/bash

exec 3< "$1"  # file whose ranges are checked for overlap with those ...
exec 4< "$2"  # ... from this file, and if so, are written to stdout

l4=-1  # lower bound of current range from file 2 
u4=-1  # upper bound
# initialized with -1 so the first range is read on the first iteration

echo "Ranges in $1 that intersect any ranges in $2:"
while read l3 u3; do  # read next range from file 1
  if (( u4 >= l3 )); then
    (( l4 <= u3 )) && echo "$l3 $u3"
  else  # the upper bound from file 2 is below the lower bound from file 1, so ...
    while read l4 u4; do  # ... we read further ranges from file 2 until ...
      if (( u4 >= l3 )); then  # ... their upper bound is high enough
        (( l4 <= u3 )) && echo "$l3 $u3"
        break
      fi
    done <&4
  fi
done <&3

The script can be called with ./script.sh file2 file1

来源：https://stackoverflow.com/questions/46033946/how-to-compare-2-lists-of-ranges-in-bash

标签

Linux

bash

awk

range

genetics