Non overlapping pattern matching with gap constraint in python

廉价感情. 提交于 2019-12-11 10:38:44

问题


I want to find total no. of non-overlapping matches of a pattern appearing in a sequence, with the gap constraint 2.

Eg. 2982f 2982l 2981l is a pattern found using some algorithm. I have to find the total # of this pattern appearing in a sequence such as 2982f 2982f 2982l 2982l 2981l 3111m 3171f 2982f 2982l 2981l … , where the max gap constraint is 2.

Gap constraint 2 means that between the pattern 2982f 2982l 2981l , maximum of 2 other words allowed. And, the main thing is all these matches should be non-overlapping.

E.g. For pattern '2982f 2982l 2981l in sequence 2982f 2982f 2982l 2982l 2981l :

  • 2982f 2982f 2982l 2982l 2981l is a match
  • 2982f 2982l 2982l 2981l is another match

So, this pattern is appearing twice, however I should count it as one as this match is overlapping.

Till now, I am storing all the indexes, where the words in the pattern appear.

pt = '2982f  2982l  2981l'

seq = '2982f  2982f  2982l  2982l  2981l  3111m 3171f  2982f  2982l  2981l  2752l 2982f  2771f  2771l  2982l  2981l  2981l 3211f 3342f 3341l 3411f 3441f 2982f  2731f  2742f  2982l  2822f  2981l 2811f 2982f  3001f 2992f 2992m  2982l  2981l'

pt_split = pt.split()
pt_dic = collections.OrderedDict()
for i in pt_split:
    pt_dic[i] = []

count_seq = 0
for i in seq.split():
    if i in pt_dic:
        pt_dic[i].append(count_seq)
    count_seq += 1

print pt_dic

Output:

OrderedDict([('2982f', [0, 1, 7, 11, 22, 29]), ('2982l', [2, 3, 8, 14, 25, 33]), ('2981l', [4, 9, 15, 16, 27, 34])])

Now my idea is that I want to subtract the indexes in a way that I can extract all the non-overlapping matches keeping gap constraint in mind. But, I am not able to understand how to proceed from this point.

Can someone please help in this, or provide even a better solution? It will be really helpful. Thanks.


回答1:


This can be solved elegantly with regex. We just have to convert the pattern into a regex and then count how often that regex matches in the input sequence.

For example, given the input pattern = 'A B C' and max_gap = 2, we want to create regex like

A(arbitrary_word){,2}?B(arbitrary_word){,2}?C

Matching arbitrary words separated by spaces can be done with (?:\S+\s+), so we get:

import re

def count_matches(pattern, seq, max_gap):
    parts = map(re.escape, pattern.split())
    sep = r'\s+(?:\S+\s+){{,{}}}?'.format(max_gap)
    regex = r'\b{}\b'.format(sep.join(parts))
    return sum(1 for _ in re.finditer(regex, seq))

Test runs:

count_matches('2982f  2982l  2981l', '2982f  2982f  2982l  2982l  2981l', 2)
# result: 1

count_matches('A B C', 'A B D E C B A B A B C', 2)
# result: 2



回答2:


I think you've done the hard part. Now just loop through the indices of the first word looking for indexes of the second word that are less than the gap, and so on. Then go back to the indices of the first word and if you found a match last time, skip any indices that fall into that match.

For example, here's the solution with only two words in pt:

i=0
while i < len(pt_dic[pt_split[0]]):
    ii = pt_dic[pt_split[0]][i]
    #print "ii=" + str(ii)
    j=0
    while j < len(pt_dic[pt_split[1]]):
        jj = pt_dic[pt_split[1]][j]
        #print "jj=" + str(jj)
        if jj > ii and jj <= ii + 2:
            print "Match: (" + str(ii) + "," + str(jj) + ")"
            # Now that we've found a match, skip indices within that match.
            i = next(x[0] for x in enumerate(pt_dic[pt_split[0]]) if x[1] > jj)
            i -= 1 # counteract the increment at the end of the outer loop
            break
        j += 1
    i += 1
    #print "i=" + str(i)

And with three words in pt:

i=0
while i < len(pt_dic[pt_split[0]]):
    match=False
    ii = pt_dic[pt_split[0]][i]
    #print "ii=" + str(ii)

    # Start loop at next index after ii
    j = next(x[0] for x in enumerate(pt_dic[pt_split[1]]) if x[1] > ii)
    while j < len(pt_dic[pt_split[1]]) and not match:
        jj = pt_dic[pt_split[1]][j]
        #print "jj=" + str(jj)
        if jj > ii and jj <= ii + 2:

            # Start loop at next index after jj
            k = next(x[0] for x in enumerate(pt_dic[pt_split[2]]) if x[1] > jj)
            while k < len(pt_dic[pt_split[2]]) and not match:
                kk = pt_dic[pt_split[2]][k]
                #print "kk=" + str(kk)
                if kk > jj and kk <= jj + 2:
                    print "Match: (" + str(ii) + "," + str(jj) + "," + str(kk) + ")"
                    # Now that we've found a match, skip indices within that match.
                    i = next(x[0] for x in enumerate(pt_dic[pt_split[0]]) if x[1] > kk)
                    i -= 1 # counteract the increment at the end of the outer loop
                    match=True
                k += 1
        j += 1
    i += 1

    #print "i=" + str(i)

And with four words in pt:

i=0
while i < len(pt_dic[pt_split[0]]):
    match=False
    ii = pt_dic[pt_split[0]][i]
    #print "ii=" + str(ii)

    # Start loop at next index after ii
    j = next(x[0] for x in enumerate(pt_dic[pt_split[1]]) if x[1] > ii)
    while j < len(pt_dic[pt_split[1]]) and not match:
        jj = pt_dic[pt_split[1]][j]
        #print "jj=" + str(jj)
        if jj > ii and jj <= ii + 2:

            # Start loop at next index after ii
            k = next(x[0] for x in enumerate(pt_dic[pt_split[2]]) if x[1] > jj)
            while k < len(pt_dic[pt_split[2]]) and not match:
                kk = pt_dic[pt_split[2]][k]
                #print "kk=" + str(kk)
                if kk > jj and kk <= jj + 2:

                    # Start loop at next index after kk
                    l = next(x[0] for x in enumerate(pt_dic[pt_split[3]]) if x[1] > kk)
                    while l < len(pt_dic[pt_split[2]]) and not match:
                        ll = pt_dic[pt_split[3]][l]
                        #print "ll=" + str(ll)
                        if ll > kk and ll <= kk + 2:
                            print "Match: (" + str(ii) + "," + str(jj) + "," + str(kk) + "," + str(ll) + ")"
                            # Now that we've found a match, skip indices within that match.
                            i = next(x[0] for x in enumerate(pt_dic[pt_split[0]]) if x[1] > ll)
                            i -= 1
                            match=True
                        l += 1
            k += 1
        j += 1
    i += 1

    #print "i=" + str(i)

I think the pattern has been established now, so generalisation to an arbitrary number of words left as exercise for reader!



来源:https://stackoverflow.com/questions/44490525/non-overlapping-pattern-matching-with-gap-constraint-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!