How to compare 2 files having random numbers in non sequential order?

问题

There are 2 files named compare 1.txt and compare2.txt having random numbers in non-sequential order

cat compare1.txt

cat compare2.txt

Aim

Output list of all the numbers which are present in compare1 but not in compare 2 and vice versa
If any number has zero in its prefix, ignore zeros while comparing ( basically the absolute value of number must be different to be treated as a mismatch ) Example - 3 should be considered matching with 003 and 014 should be considered matching with 14, 008 with 8 etc

Note - It is not necessary that matching must necessarily happen on the same line. A number present in the first line in compare1 should be considered matched even if that same number is present on other than the first line in compare2

Expected output

PS ( I don't necessarily need this exact order in expected output, just these 4 numbers in any order would do )

What I tried?

Obviously I didn't have hopes of getting the second condition correct, I tried only fulfilling the first condition but couldn't get correct results. I had tried these commands

grep -Fxv -f compare1.txt compare2.txt && grep -Fxv -f compare2.txt compare1.txt

cat compare1.txt compare2.txt | sort |uniq

Edit - A Python solution is also fine

回答1:

Could you please try following, written and tested with shown samples in GNU awk.

awk '
{
  $0=$0+0
}
FNR==NR{
  a[$0]
  next
}
($0 in a){
  b[$0]
  next
}
{ print }
END{
  for(j in a){
    if(!(j in b)){ print j }
  }
}
'  compare1.txt compare2.txt

Explanation: Adding detailed explanation for above.

awk '                                ##Starting awk program from here.
{
  $0=$0+0                            ##Adding 0 will remove extra zeros from current line,considering that your file doesn't have float values.
}
FNR==NR{                             ##Checking condition FNR==NR which will be TRUE when 1st Input_file is being read.
  a[$0]                              ##Creating array a with index of current line here.
  next                               ##next will skip all further statements from here.
}
($0 in a){                           ##Checking condition if current line is present in a then do following.
  b[$0]                              ##Creating array b with index of current line.
  next                               ##next will skip all further statements from here.
}
{ print }                                   ##will print current line from 2nd Input_file here.
END{                                 ##Starting END block of this code from here.
  for(j in a){                       ##Traversing through array a here.
    if(!(j in b)){ print j }         ##Checking condition if current index value is NOT present in b then print that index.
  }
}
'  compare1.txt compare2.txt         ##Mentioning Input_file names here.

回答2:

Here's how to do what you want just using awk:

$ awk '{$0+=0} NR==FNR{a[$0];next} !($0 in a)' compare1.txt compare2.txt
12
90

$ awk '{$0+=0} NR==FNR{a[$0];next} !($0 in a)' compare2.txt compare1.txt
11
91

but this is the job that comm exists to do so here's how you could use that to get all differences and common lines at once. In the following output col1 is compare1.txt only, col2 is compare2.txt only, col3 is common between both files:

$ comm <(awk '{print $0+0}' compare1.txt | sort) <(awk '{print $0+0}' compare2.txt | sort)
11
    12
        13
        14
        3
        57
        889
    90
91

or to get each result individually:

$ comm -23 <(awk '{print $0+0}' compare1.txt | sort) <(awk '{print $0+0}' compare2.txt | sort)
11
91

$ comm -13 <(awk '{print $0+0}' compare1.txt | sort) <(awk '{print $0+0}' compare2.txt | sort)
12
90

$ comm -12 <(awk '{print $0+0}' compare1.txt | sort) <(awk '{print $0+0}' compare2.txt | sort)
13
14
3
57
889

回答3:

Given those two files, in Python, you can use a symmetric difference of sets:

with open(f1) as f:         # read the first file into a set
    s1={int(e) for e in f}
    
with open(f2) as f:         # read the second file into a set
    s2={int(e) for e in f}
    
print(s2 ^ s1)              # symmetric difference of those two sets
# {11, 12, 90, 91}

Which can be further simplified to:

with open(f1) as f1, open(f2) as f2:
    print({int(e) for e in f1} ^ {int(e) for e in f2})

回答4:

I think I heard somewhere that a Ruby solution would be OK, so I will give two, but if Ruby is on the black list, at least one of the methods can be easily translated to a language on the approved list (no knowledge of Ruby required). The first method uses sets, which Ruby implements with hashes under the covers. The second method uses hashes. I've provided the latter should the language of choice not support set objects.

The main thing is to use a method that is close to O(n) in computational complexity, where n is the sum of the sizes of the two arrays. I say "close to" O(n) because the methods I suggest use hashes, directly or indirectly, and hash lookups are not quite O(1). The conventional approach to this problem, enumerating the second array for each element of the first, and vice-versa, has a computational complexity of O(n²).

We are given two arrays:

arr1 = ["57", "11", "13", "3", "889", "014", "91"] 
arr2 = ["003", "889", "13", "14", "57", "12", "90"]

Use sets

require 'set'

def not_in_other(a1, a2)
  st = a2.map(&:to_i).to_set
  a1.reject { |s| st.include?(s.to_i) }
end

not_in_other(arr1, arr2) + not_in_other(arr1, arr2)
  #=> ["11", "91", "11", "91"]

Note:

a = arr2.map(&:to_i)
  #=> [3, 889, 13, 14, 57, 12, 90] 
a.to_set
  #=> #<Set: {3, 889, 13, 14, 57, 12, 90}>

Use hashes

Step 1: Construct a hash for each array

def hashify(arr)
  arr.each_with_object({}) { |s,h| h[s.to_i] = s }
end

h1 = hashify(arr1)
  #=> {57=>"57", 11=>"11", 13=>"13", 3=>"03", 889=>"889",
  #    14=>"014", 91=>"91"} 
h2 = hashify(arr2)
  #=> {3=>"003", 889=>"889", 13=>"13", 12=>"12", 14=>"14",
  #    57=>"57", 90=>"90"}

The meanings of these hashes (whose keys are integers) should be self-evident.

Step 2: Determine which keys in each hash are not present in the other hash

keys1 = h1.keys
  #=> [57, 11, 13, 3, 889, 14, 91] 
keys2.keys
  #=> [3, 889, 13, 12, 14, 57, 90] 

keepers1 = keys1.reject { |k| h2.key?(k) }
  #=> [11, 91] 
keepers2 = keys2.reject { |k| h1.key?(k) }
  #=> [12, 90]

One could alternatively write:

keepers1 = keys1 - keys2
keepers2 = keys2 - keys1

I expect this would be O(n), but that would depend on the implementation.

Step 3: Obtain the values of h1 for keys keepers1 and of h2 for keys keepers2, and combine them

h1.values_at(*keepers1) + h2.values_at(*keepers2)
  #=> ["11", "91", "12", "90"]

回答5:

Using python, you can do the following,

import csv

def func(file1, file2):
    set1 = read_file_as_set(file1)
    set2 = read_file_as_set(file2)

    union = set1.union(set2) #find union first
    intersection = set1.intersection(set2) #find intersection
    return union.difference(intersection)


def read_file_as_set(file):
    result = set()

    with open(file) as csv_file:
        file_reader = csv.reader(csv_file)

        for line in file_reader:
            result.add(int(line[0]))

    return result

if __name__=='__main__':

    print func("path/to/first/file.csv","path/to/second/file.csv")

I am essentially reading both files as separate sets and returning (file1_set union file2_set) - (file1_set intersection with file2_set)

回答6:

Another alternative solution by one of my friend in python

list1 = set()
list2 = set()
with open('compare1.txt','r') as file1:
    for line in file1:
        if line != '\n':
            list1.add(int(line))

with open('compare2.txt','r') as file2:
    for line in file2:
        if line != '\n':
            list2.add(int(line))

list3 = list1.symmetric_difference(list2)

for number in list3:
    print(number)

回答7:

Another solution in python:

x = [int(x) for x in open("compare1.txt")]
y = [int(x) for x in open("compare2.txt")]
z = []

for i in x:
    if (i not in y):
        z.append(i)


for i in y:
    if (i not in x):
        z.append(i)

for i in z:
    print(i)

回答8:

Besides handling leading zeros, your task can be solved just be using diff command, and filtering its output

diff "$FIRST" "$SECOND" \
        | awk '$1~/[<>]/{print $2}' # Only added or removed lines

You can get rid of leading zeros with bc

FIRST=${1:-first file should be specified}
SECOND=${2:-second file should be specified}
normalize() {
    bc < "$1" | sort --numeric
}
diff <(normalize "$FIRST") <(normalize "$SECOND") \
        | awk '$1~/[<>]/{print $2}'

Note that, process subsitution syntax <(command) is a bashism, you will need to use a temporary file instead for POSIX complience.

来源：https://stackoverflow.com/questions/62495381/how-to-compare-2-files-having-random-numbers-in-non-sequential-order

标签

python

awk