问题
I'm comparing 2 files with an initial identifier column, start value, and end value. The second file contains corresponding identifiers and another value column.
Ex.
File 1:
A 200 900
A 1000 1200
B 100 700
B 900 1000
File 2:
A 103
A 200
A 250
B 50
B 100
B 150
I would like to find all values from the second file that are contained within the ranges found in the first file so that my output would look like:
A 200
A 250
B 100
B 150
For now I have created a dictionary from the first file with a list of ranges: Ex.
if Identifier in Dictionary:
Dictionary[Identifier].extend(range(Start, (End+1)))
else:
Dictionary[Identifier] = range(Start, (End+1))
I then go through the second file and search for the value within the dictionary of ranges: Ex.
if Identifier in Dictionary:
if Value in Dictionary[Identifier]:
OutFile.write(Line + "\n")
While not optimal this works for relatively small files, however I have several large files and this program is proving terribly inefficient. I need to optimize my program so that it will run much faster.
回答1:
from collections import defaultdict
ident_ranges = defaultdict(list)
with open('file1.txt', 'r') as f1
for row in f1:
ident, start, end = row.split()
start, end = int(start), int(end)
ident_ranges[ident].append((start, end))
with open('file2.txt', 'r') as f2, open('out.txt', 'w') as output:
for line in f2:
ident, value = line.split()
value = int(value)
if any(start <= value <= end for start, end in ident_ranges[ident]):
output.write(line)
Notes: Using a defaultdict
allows you to add ranges to your dictionary without first checking for the existence of a key. Using any
allows for short circuiting of the range check. Using chained comparision is a nice Python syntactic shortcut (start <= value <= end
).
回答2:
Do you need to construct range(START, END)
? That seems quite wasteful when you can do:
if START <= x <= END:
# process
Checking if the value is in the range is slow because a) you've had to construct the list and b) perform a linear search over the list to find it.
回答3:
You can try something like this:
In [27]: ranges=defaultdict(list)
In [28]: with open("file1") as f:
for line in f:
name,st,end=line.split()
st,end=int(st),int(end)
ranges[name].append([st,end])
....:
In [30]: ranges
Out[30]: defaultdict(<type 'list'>, {'A': [[200, 900], [1000, 1200]], 'B': [[100, 700], [900, 1000]]})
In [29]: with open("file2") as f:
for line in f:
name,val=line.split()
val=int(val)
if any(y[0]<=val<=y[1] for y in ranges[name]):
print name,val
....:
A 200
A 250
B 100
B 150
回答4:
Neat trick: Python lets you do in
comparisons with xrange
objects, which is much faster than doing in
with a range
, and much more memory efficient.
So, you can do
from collections import defaultdict
rangedict = defaultdict(list)
...
rangedict[ident].append(xrange(start, end+1))
...
for i in rangedict:
for r in rangedict[i]:
if v in r:
print >>outfile, line
回答5:
Since you've got large ranges and your problem is essentially just a bunch of comparisons, it's almost certainly faster to store a start/end tuple than the whole range (especially since what you have now is going to duplicate most of the numbers in the ranges if two happen to overlap).
# Building the dict
if not ident in d:
d[ident] = (lo, hi)
else:
old_lo, old_hi = d[ident]
d[ident] = (min(lo, old_lo), max(hi, old_hi))
Then your comparisons just look like:
# comparing...
if ident in d:
if d[ident][0] <= val <= d[ident][1]:
outfile.write(line+'\n')
Both parts of this will go faster if you aren't making separate checks for if ident in d
. Python dictionaries are nice and fast, so just make the call to it in the first place. You've got the ability to provide defaults to the dictionary, so use it. I haven't benchmarked this or anything to see what the speedup is, but you'd certainly get some, and it certainly works:
# These both make use of the following somewhat silly hack:
# In Python, None is treated as less than everything (even -float('inf))
# and empty containers (e.g. (), [], {}) are treated as greater than everything.
# So we use the tuple ((), None) as if it was (float('inf'), float('-inf))
for line in file1:
ident, lo, hi = line.split()
lo = int(lo)
hi = int(hi)
old_lo, old_hi = d.get(ident, ((), None))
d[ident] = (min(lo, old_lo), max(hi, old_hi))
# comparing:
for line in file2:
ident, val = line.split()
val = int(val)
lo, hi = d.get(ident, ((), None))
if lo <= val <= hi:
outfile.write(line) # unless you stripped it off, this still has a \n
The above code is what I was using to test; it runs on a file2
of a million lines in a couple seconds.
来源:https://stackoverflow.com/questions/15392623/finding-a-value-within-a-dictionary-of-ranges-python