问题
I'm comparing 2 files with an initial identifier column, start value, and end value. The second file contains corresponding identifiers and another value column.
Ex.
File 1:
A 200 900
A 1000 1200
B 100 700
B 900 1000
File 2:
A 103
A 200
A 250
B 50
B 100
B 150
I would like to find all values from the second file that are contained within the ranges found in the first file so that my output would look like:
A 200
A 250
B 100
B 150
For now I have created a dictionary from the first file with a list of ranges: Ex.
if Identifier in Dictionary:
Dictionary[Identifier].extend(range(Start, (End+1)))
else:
Dictionary[Identifier] = range(Start, (End+1))
I then go through the second file and search for the value within the dictionary of ranges: Ex.
if Identifier in Dictionary:
if Value in Dictionary[Identifier]:
OutFile.write(Line + "\n")
While not optimal this works for relatively small files, however I have several large files and this program is proving terribly inefficient. I need to optimize my program so that it will run much faster.
回答1:
from collections import defaultdict
ident_ranges = defaultdict(list)
with open('file1.txt', 'r') as f1
for row in f1:
ident, start, end = row.split()
start, end = int(start), int(end)
ident_ranges[ident].append((start, end))
with open('file2.txt', 'r') as f2, open('out.txt', 'w') as output:
for line in f2:
ident, value = line.split()
value = int(value)
if any(start <= value <= end for start, end in ident_ranges[ident]):
output.write(line)
Notes: Using a defaultdict allows you to add ranges to your dictionary without first checking for the existence of a key. Using any allows for short circuiting of the range check. Using chained comparision is a nice Python syntactic shortcut (start <= value <= end).
回答2:
Do you need to construct range(START, END)? That seems quite wasteful when you can do:
if START <= x <= END:
# process
Checking if the value is in the range is slow because a) you've had to construct the list and b) perform a linear search over the list to find it.
回答3:
You can try something like this:
In [27]: ranges=defaultdict(list)
In [28]: with open("file1") as f:
for line in f:
name,st,end=line.split()
st,end=int(st),int(end)
ranges[name].append([st,end])
....:
In [30]: ranges
Out[30]: defaultdict(<type 'list'>, {'A': [[200, 900], [1000, 1200]], 'B': [[100, 700], [900, 1000]]})
In [29]: with open("file2") as f:
for line in f:
name,val=line.split()
val=int(val)
if any(y[0]<=val<=y[1] for y in ranges[name]):
print name,val
....:
A 200
A 250
B 100
B 150
回答4:
Neat trick: Python lets you do in comparisons with xrange objects, which is much faster than doing in with a range, and much more memory efficient.
So, you can do
from collections import defaultdict
rangedict = defaultdict(list)
...
rangedict[ident].append(xrange(start, end+1))
...
for i in rangedict:
for r in rangedict[i]:
if v in r:
print >>outfile, line
回答5:
Since you've got large ranges and your problem is essentially just a bunch of comparisons, it's almost certainly faster to store a start/end tuple than the whole range (especially since what you have now is going to duplicate most of the numbers in the ranges if two happen to overlap).
# Building the dict
if not ident in d:
d[ident] = (lo, hi)
else:
old_lo, old_hi = d[ident]
d[ident] = (min(lo, old_lo), max(hi, old_hi))
Then your comparisons just look like:
# comparing...
if ident in d:
if d[ident][0] <= val <= d[ident][1]:
outfile.write(line+'\n')
Both parts of this will go faster if you aren't making separate checks for if ident in d. Python dictionaries are nice and fast, so just make the call to it in the first place. You've got the ability to provide defaults to the dictionary, so use it. I haven't benchmarked this or anything to see what the speedup is, but you'd certainly get some, and it certainly works:
# These both make use of the following somewhat silly hack:
# In Python, None is treated as less than everything (even -float('inf))
# and empty containers (e.g. (), [], {}) are treated as greater than everything.
# So we use the tuple ((), None) as if it was (float('inf'), float('-inf))
for line in file1:
ident, lo, hi = line.split()
lo = int(lo)
hi = int(hi)
old_lo, old_hi = d.get(ident, ((), None))
d[ident] = (min(lo, old_lo), max(hi, old_hi))
# comparing:
for line in file2:
ident, val = line.split()
val = int(val)
lo, hi = d.get(ident, ((), None))
if lo <= val <= hi:
outfile.write(line) # unless you stripped it off, this still has a \n
The above code is what I was using to test; it runs on a file2 of a million lines in a couple seconds.
来源:https://stackoverflow.com/questions/15392623/finding-a-value-within-a-dictionary-of-ranges-python