Find all the numbers in one file that are not in another file in python

南笙酒味 提交于 2019-12-18 18:50:05

问题


There are two files, say FileA and FileB and we need to find all the numbers that are in FileA which is not there in FileB. All the numbers in the FileA are sorted and all the numbers in FileB are sorted. For example,

Input:

FileA = [1, 2, 3, 4, 5, ...]
FileB = [1, 3, 4, 6, ...]

Output:

[2, 5, ...]

The memory is very limited and even one entire file cannot be loaded into memory at a time. Also linear or lesser time complexity is needed.

So if the files are small enough to fit in the memory, we could load them and initialize its contents as two sets and then take a set difference so that the problem is solved in O(1) or constant time complexity.

set(contentsofFileA)-set(contentsofFileB)

But since the files are so big, they won't be able to load entirely into the memory and so this is not possible.

Also, another approach would be to use a brute force method with batch processing. So, we load a chunk or batch of data from FileA and then a batch from FileB and then compare it and then the next chunk from FileB and so on. Then after the FileA chunk is checked over all the elements in FileB then load the next batch from FileA and this continues. But this would create an O(n^2) or quadratic time complexity and not efficient for a very large file with large entries.

The problem is required to be solved in linear or lesser time complexity and without loading the entire files into memory. Any help?


回答1:


If you want to read the files line by line since you don't have so much memory and you need a linear solution you can do this with iter if your files are line based, otherwise see this:

First in your terminal you can do this to generate some test files:

seq 0 3 100 > 3k.txt
seq 0 2 100 > 2k.txt

Then you run this code:

i1 = iter(open("3k.txt"))
i2 = iter(open("2k.txt"))
a = int(next(i1))
b = int(next(i2))
aNotB = []
# bNotA = []
while True:
    try:
        if a < b:
            aNotB += [a]
            a = int(next(i1, None))
        elif a > b:
            # bNotA += [a]
            b = int(next(i2, None))
        elif a == b:
            a = int(next(i1, None))
            b = int(next(i2, None))
    except TypeError:
        if not b:
            aNotB += list(i1)
            break
        else:
            # bNotA += list(i1)
            break
print(aNotB)

Output:

[3, 9, 15, 21, 27, 33, 39, 45, 51, 57, 63, 69, 75, 81, 87, 93, 99] If you want both the result for aNotB and bNotA you can uncomment those two lines.

Timing comparison with Andrej Kesely's answer:

$ seq 0 3 1000000 > 3k.txt
$ seq 0 2 1000000 > 2k.txt
$ time python manual_iter.py        
python manual_iter.py  0.38s user 0.00s system 99% cpu 0.387 total
$ time python heapq_groupby.py        
python heapq_groupby.py  1.11s user 0.00s system 99% cpu 1.116 total



回答2:


As files are sorted you can just iterate through each line at a time, if the line of file A is less than the line of file B then you know that A is not in B so you then increment file A only and then check again. If the line in A is greater than the line in B then you know that B is not in A so you increment file B only. If A and B are equal then you know line is in both so increment both files. while in your original question you stated you were interested in entries which are in A but not B, this answer will extend that and also give entries in B not A. This extends the flexability but still allows you so print just those in A not B.

def strip_read(file):
    return file.readline().rstrip()

in_a_not_b = []
in_b_not_a = []
with open("fileA") as A:
    with open("fileB") as B:
        Aline = strip_read(A)
        Bline = strip_read(B)
        while Aline or Bline:
            if Aline < Bline and Aline:
                in_a_not_b.append(Aline)
                Aline = strip_read(A)
            elif Aline > Bline and Bline:
                in_b_not_a.append(Bline)
                Bline = strip_read(B)
            else:
                Aline = strip_read(A)
                Bline = strip_read(B)

print("in A not in B", in_a_not_b, "\nin B not in A", in_b_not_a)

OUTPUT for my sample Files

in A not in B ['2', '5', '7'] 
in B not in A ['6']



回答3:


You can combine itertools.groupby (doc) and heapq.merge (doc) to iterate through FileA and FileB lazily (it works as long the files are sorted!)

FileA = [1, 1, 2, 3, 4, 5]
FileB = [1, 3, 4, 6]

from itertools import groupby
from heapq import merge

gen_a = ((v, 'FileA') for v in FileA)
gen_b = ((v, 'FileB') for v in FileB)

for v, g in groupby(merge(gen_a, gen_b, key=lambda k: int(k[0])), lambda k: int(k[0])):
    if any(v[1] == 'FileB' for v in g):
        continue
    print(v)

Prints:

2
5

EDIT (Reading from files):

from itertools import groupby
from heapq import merge

gen_a = ((int(v.strip()), 1) for v in open('3k.txt'))
gen_b = ((int(v.strip()), 2) for v in open('2k.txt'))

for v, g in groupby(merge(gen_a, gen_b, key=lambda k: k[0]), lambda k: k[0]):
    if any(v[1] == 2 for v in g):
        continue
    print(v)

Benchmark:

Generating files with 10_000_000 items:

seq 0 3 10000000 > 3k.txt
seq 0 2 10000000 > 2k.txt

The script takes ~10sec to complete:

real    0m10,656s
user    0m10,557s
sys 0m0,076s



回答4:


A simple solution based on file reading (asuming that each line hold a number):

results = []
with open('file1.csv') as file1, open('file2.csv') as file2:
        var1 = file1.readline()
        var2 = file2.readline()
        while var1:
            while var1 and var2:
                if int(var1) < int(var2):
                    results.append(int(var1))
                    var1 = file1.readline()
                elif int(var1) > int(var2):
                    var2 = file2.readline()
                elif int(var1) == int(var2):
                    var1 = file1.readline()
                    var2 = file2.readline()
            if var1:
                results.append(int(var1))
                var1 = file1.readline()
print(results)
output = [2, 5, 7, 9]



回答5:


This is similar to the classic Knuth Sorting and Searching. You may wish to consider reading stack question, on-line lecture notes pdf, and Wikipedia. The stack question mentions something that I agree with, which is using unix sort command. Always, always test with your own data to ensure the method chosen is the most efficient for your data because some of these algorithms are data dependant.



来源:https://stackoverflow.com/questions/57287400/find-all-the-numbers-in-one-file-that-are-not-in-another-file-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!