问题
I want to find total no. of non-overlapping matches of a pattern appearing in a sequence, with the gap constraint 2.
Eg. 2982f 2982l 2981l
is a pattern found using some algorithm. I have to find the total # of this pattern appearing in a sequence such as 2982f 2982f 2982l 2982l 2981l 3111m 3171f 2982f 2982l 2981l …
, where the max gap constraint is 2.
Gap constraint 2 means that between the pattern 2982f 2982l 2981l
, maximum of 2 other words allowed. And, the main thing is all these matches should be non-overlapping.
E.g. For pattern '2982f 2982l 2981l
in sequence 2982f 2982f 2982l 2982l 2981l
:
2982f 2982f 2982l 2982l 2981l
is a match2982f 2982l 2982l 2981l
is another match
So, this pattern is appearing twice, however I should count it as one as this match is overlapping.
Till now, I am storing all the indexes, where the words in the pattern appear.
pt = '2982f 2982l 2981l'
seq = '2982f 2982f 2982l 2982l 2981l 3111m 3171f 2982f 2982l 2981l 2752l 2982f 2771f 2771l 2982l 2981l 2981l 3211f 3342f 3341l 3411f 3441f 2982f 2731f 2742f 2982l 2822f 2981l 2811f 2982f 3001f 2992f 2992m 2982l 2981l'
pt_split = pt.split()
pt_dic = collections.OrderedDict()
for i in pt_split:
pt_dic[i] = []
count_seq = 0
for i in seq.split():
if i in pt_dic:
pt_dic[i].append(count_seq)
count_seq += 1
print pt_dic
Output:
OrderedDict([('2982f', [0, 1, 7, 11, 22, 29]), ('2982l', [2, 3, 8, 14, 25, 33]), ('2981l', [4, 9, 15, 16, 27, 34])])
Now my idea is that I want to subtract the indexes in a way that I can extract all the non-overlapping matches keeping gap constraint in mind. But, I am not able to understand how to proceed from this point.
Can someone please help in this, or provide even a better solution? It will be really helpful. Thanks.
回答1:
This can be solved elegantly with regex. We just have to convert the pattern
into a regex and then count how often that regex matches in the input sequence.
For example, given the input pattern = 'A B C'
and max_gap = 2
, we want to create regex like
A(arbitrary_word){,2}?B(arbitrary_word){,2}?C
Matching arbitrary words separated by spaces can be done with (?:\S+\s+)
, so we get:
import re
def count_matches(pattern, seq, max_gap):
parts = map(re.escape, pattern.split())
sep = r'\s+(?:\S+\s+){{,{}}}?'.format(max_gap)
regex = r'\b{}\b'.format(sep.join(parts))
return sum(1 for _ in re.finditer(regex, seq))
Test runs:
count_matches('2982f 2982l 2981l', '2982f 2982f 2982l 2982l 2981l', 2)
# result: 1
count_matches('A B C', 'A B D E C B A B A B C', 2)
# result: 2
回答2:
I think you've done the hard part. Now just loop through the indices of the first word looking for indexes of the second word that are less than the gap, and so on. Then go back to the indices of the first word and if you found a match last time, skip any indices that fall into that match.
For example, here's the solution with only two words in pt:
i=0
while i < len(pt_dic[pt_split[0]]):
ii = pt_dic[pt_split[0]][i]
#print "ii=" + str(ii)
j=0
while j < len(pt_dic[pt_split[1]]):
jj = pt_dic[pt_split[1]][j]
#print "jj=" + str(jj)
if jj > ii and jj <= ii + 2:
print "Match: (" + str(ii) + "," + str(jj) + ")"
# Now that we've found a match, skip indices within that match.
i = next(x[0] for x in enumerate(pt_dic[pt_split[0]]) if x[1] > jj)
i -= 1 # counteract the increment at the end of the outer loop
break
j += 1
i += 1
#print "i=" + str(i)
And with three words in pt:
i=0
while i < len(pt_dic[pt_split[0]]):
match=False
ii = pt_dic[pt_split[0]][i]
#print "ii=" + str(ii)
# Start loop at next index after ii
j = next(x[0] for x in enumerate(pt_dic[pt_split[1]]) if x[1] > ii)
while j < len(pt_dic[pt_split[1]]) and not match:
jj = pt_dic[pt_split[1]][j]
#print "jj=" + str(jj)
if jj > ii and jj <= ii + 2:
# Start loop at next index after jj
k = next(x[0] for x in enumerate(pt_dic[pt_split[2]]) if x[1] > jj)
while k < len(pt_dic[pt_split[2]]) and not match:
kk = pt_dic[pt_split[2]][k]
#print "kk=" + str(kk)
if kk > jj and kk <= jj + 2:
print "Match: (" + str(ii) + "," + str(jj) + "," + str(kk) + ")"
# Now that we've found a match, skip indices within that match.
i = next(x[0] for x in enumerate(pt_dic[pt_split[0]]) if x[1] > kk)
i -= 1 # counteract the increment at the end of the outer loop
match=True
k += 1
j += 1
i += 1
#print "i=" + str(i)
And with four words in pt:
i=0
while i < len(pt_dic[pt_split[0]]):
match=False
ii = pt_dic[pt_split[0]][i]
#print "ii=" + str(ii)
# Start loop at next index after ii
j = next(x[0] for x in enumerate(pt_dic[pt_split[1]]) if x[1] > ii)
while j < len(pt_dic[pt_split[1]]) and not match:
jj = pt_dic[pt_split[1]][j]
#print "jj=" + str(jj)
if jj > ii and jj <= ii + 2:
# Start loop at next index after ii
k = next(x[0] for x in enumerate(pt_dic[pt_split[2]]) if x[1] > jj)
while k < len(pt_dic[pt_split[2]]) and not match:
kk = pt_dic[pt_split[2]][k]
#print "kk=" + str(kk)
if kk > jj and kk <= jj + 2:
# Start loop at next index after kk
l = next(x[0] for x in enumerate(pt_dic[pt_split[3]]) if x[1] > kk)
while l < len(pt_dic[pt_split[2]]) and not match:
ll = pt_dic[pt_split[3]][l]
#print "ll=" + str(ll)
if ll > kk and ll <= kk + 2:
print "Match: (" + str(ii) + "," + str(jj) + "," + str(kk) + "," + str(ll) + ")"
# Now that we've found a match, skip indices within that match.
i = next(x[0] for x in enumerate(pt_dic[pt_split[0]]) if x[1] > ll)
i -= 1
match=True
l += 1
k += 1
j += 1
i += 1
#print "i=" + str(i)
I think the pattern has been established now, so generalisation to an arbitrary number of words left as exercise for reader!
来源:https://stackoverflow.com/questions/44490525/non-overlapping-pattern-matching-with-gap-constraint-in-python