Compare files line by line to see if they are the same, if so output them

问题

How would I go about this, I have files which I have sorted the information in, I want to compare a certain index in that file with an index in another, one problem is that the files are enormously large, millions of lines. I want to compare line by line the files I have, if they match I want to input both those values along with other values using an index method.

=======================

Let me clarify, I want to take say line[x] the x will remain the same as it is formatted uniformly, I want to run line[x] against line[y] in another file, I want to do this to the whole file and output every matching pair to another file. In that other file I also want to be able to include other pieces from the first file which would be like just adding more indexes such as; line[a],line[b],line[c],line[d], and finally line[y] as the match to that information.

Try 3:

I have a file with information in this format:

#x is a line

 x= data,data,data,data,data,data

there is millions of lines of that.

I have another file, same format:

    xis a line
    x= data,data,data,data

I want to use x[#] from first file and x[#] from second file, I want to see if those two values match, if they do I want to output those, along with several other x[#] values from the second file, which are on the same line.

Did that help at all to understand? The format the files are in are like i said:(but there is millions, and I want to find the pairs in the two files because they all should match up)

  line 1  data,data,data,data
  line 2  data,data,data,data

data from file 1:

 (N'068D556A1A665123A6DD2073A36C1CAF', N'A76EEAF6D310D4FD2F0BD610FAC02C04DFE6EB67',    
N'D7C970DFE09687F1732C568AE1CFF9235B2CBB3673EA98DAA8E4507CC8B9A881');

data from file 2:

00000040f2213a27ff74019b8bf3cfd1|index.docbook|Redhat 7.3 (32bit)|Linux
00000040f69413a27ff7401b8bf3cfd1|index.docbook|Redhat 8.0 (32bit)|Linux
00000965b3f00c92a18b2b31e75d702c|Localizable.strings|Mac OS X 10.4|OSX
0000162d57845b6512e87db4473c58ea|SYSTEM|Windows 7 Home Premium (32bit)|Windows
000011b20f3cefd491dbc4eff949cf45|totem.devhelp|Linux Ubuntu Desktop 9.10 (32bit)|Linux

The order it is sorted in is alphanumeric, and I want to use a slider method. By that I mean if file1[x] is < file2[x] move the slider down or up depending on whether one value is greater than the other, until a match is found, when and if so, print the output along with other values that will identify that hash.

What I want as a result would be:

file1[x] and its corresponding match on file2[x] outputted to a file, as well as other file1[x] where x can be any index from the line.

回答1:

What I got from the clarification:

file1 and file2 are in the same format, where each line looks like
```
{32 char hex key}|{text1}|{text2}|{text3}
```
the files are sorted in ascending order by key
for each key that appears in both file1 and file2, you want merged output, so each line looks like
```
{32 char hex key}|{text11}|{text12}|{text13}|{text21}|{text22}|{text23}
```

You basically want the collisions from a merge sort:

import csv

def getnext(csvfile, key=lambda row: int(row[0], 16)):
    row = csvfile.next()
    return key(row),row

with open('file1.dat','rb') as inf1, open('file2.dat','rb') as inf2, open('merged.dat','wb') as outf:
    a = csv.reader(inf1, delimiter='|')
    b = csv.reader(inf2, delimiter='|')
    res = csv.writer(outf, delimiter='|')

    a_key, b_key = -1, 0
    try:
        while True:
            while a_key < b_key:
                a_key, a_row = getnext(a)
            while b_key < a_key:
                b_key, b_row = getnext(b)
            if a_key==b_key:
                res.writerow(a_row + b_row[1:])
    except StopIteration:
        # reached the end of an input file
        pass

I still have no idea what you are trying to communicate by 'as well as other file1[x] where x can be any index from the line'.

回答2:

using this method and comparing compare line by line you don't have to store files in the memory as the files are huge in size.

with open('file1.txt') as f1, open('file2.txt') as f2, open('file3.txt','w') as f3:
    for x, y in zip(f1, f2): 
        if x == y:
            f3.write(x)

回答3:

Comparing the contents of two files at a specified index:

fp1 = open("file1.txt", "r")
fp2 = open("file2.txt", "r")

fp1.seek(index)
fp2.seek(index)

line1 = fp1.readline()
line2 = fp2.readline()

if line1 == line2:
    print(line1)

fp1.close()
fp2.close()

Comparing two files line by line to see if they match, otherwise print the line:

fp1 = open("file1.txt", "r")
fp2 = open("file2.txt", "r")

line1, line2 = fp1.readline(), fp2.readline()

while line1 and line2:
    if line1 != line2:
        print("Mismatch.\n1: %s\n2: %s" % (line1, line2))

fp1.close()
fp2.close()

来源：https://stackoverflow.com/questions/11253667/compare-files-line-by-line-to-see-if-they-are-the-same-if-so-output-them

标签

python

slider

analytics

match

string-matching