Python regex module fuzzy match: substitution count not as expected

喜你入骨 提交于 2020-01-04 18:42:32

问题


Background

The Python module regex allows fuzzy matching.

You can specify the allowable number of substitutions (s), insertions (i), deletions (d), and total errors (e).

The fuzzy_counts property of a match result returns a tuple (0,0,0), where:

match.fuzzy_counts[0] = count for 's' 
match.fuzzy_counts[1] = count for 'i' 
match.fuzzy_counts[2] = count for 'd'

Problem

The deletions and insertions are counted as expected, but not the substitutions.

In the example below, the only change is a single character deleted in the query, yet the substitutions count is 6 (7 if the BESTMATCH option is removed).

How are the substitutions counted?

I would be grateful of someone can anyone explain how this works to me.

>>> import regex
>>> reference = "(TATGGGA[CT][GC]AAAG[CT]CT[AC]AA[GA]CCATGTG){s<7,i<3,d<3,e<8}"
>>> query = "TATGGACCAAAGTCTCAAGCCATGTG" 
>>> match = regex.search(reference, query, regex.BESTMATCH)
>>> print(match.fuzzy_counts)
(6,0,1)

回答1:


This was caused by what looks to be a bug in the regex module's cost calculations. It was still present up until regex version 2015.10.05, but was fixed in the next version, 2015.10.22, as shown below:

$ sudo pip3 install regex==2015.10.05
Processing /root/.cache/pip/wheels/24/cb/ae/9653e30c8f801544a645e17d26fa6803aeaf76ad0482663c27/regex-2015.10.5-cp38-cp38-linux_x86_64.whl
Installing collected packages: regex
Successfully installed regex-2015.10.5
$ python3 -c 'import regex; reference = "(TATGGGA[CT][GC]AAAG[CT]CT[AC]AA[GA]CCATGTG){s<7,i<3,d<3,e<8}"; query = "TATGGACCAAAGTCTCAAGCCATGTG"; match = regex.search(reference, query, regex.BESTMATCH);print(match.fuzzy_counts)'
(5, 0, 1)
$ sudo pip3 install regex==2015.10.22
Processing /root/.cache/pip/wheels/60/f6/9a/23e723633e62a79064cb301c54a3b50482b8c690f86c9983ee/regex-2015.10.22-cp38-cp38-linux_x86_64.whl
Installing collected packages: regex
  Found existing installation: regex 2015.10.5
    Uninstalling regex-2015.10.5:
      Successfully uninstalled regex-2015.10.5
Successfully installed regex-2015.10.22
$ python3 -c 'import regex; reference = "(TATGGGA[CT][GC]AAAG[CT]CT[AC]AA[GA]CCATGTG){s<7,i<3,d<3,e<8}"; query = "TATGGACCAAAGTCTCAAGCCATGTG"; match = regex.search(reference, query, regex.BESTMATCH);print(match.fuzzy_counts)'
(0, 0, 1)

Given these dates, I infer that the commit that fixed the bug was https://bitbucket.org/mrabarnett/mrab-regex/commits/296c1daf86619039c6fe55868e7d861097d01aae, with description

Hg issue 161: Unexpected fuzzy match results

Fixed the bug and did some related tidying up.

The referenced bug is https://bitbucket.org/mrabarnett/mrab-regex/issues/161.




回答2:


The issue seems to be related to the value in the allowed error setting.

Reducing the s to s < 3 changes the fuzzy match tuple score downwards:

>>> reference = "(TATGGGA[CT][GC]AAAG[CT]CT[AC]AA[GA]CCATGTG){s<3,i<3,d<3,e<4}" 
>>> query = "TATGGACCAAAGTCTCAAGCCATGTG"  
>>> match = regex.search(reference, query, regex.BESTMATCH)
>>> print(match.fuzzy_counts) 
(1,0,1)

reducing the allowed error for 's' even further returns the expected 's' score for this match:

>>> reference = "(TATGGGA[CT][GC]AAAG[CT]CT[AC]AA[GA]CCATGTG){s<2,i<3,d<3,e<4}"
>>> query = "TATGGACCAAAGTCTCAAGCCATGTG" 
>>> match = regex.search(reference, query, regex.BESTMATCH)
>>> print(match.fuzzy_counts)
(0,0,1)

Why it behaves in this way is still a mystery to me.



来源:https://stackoverflow.com/questions/31193749/python-regex-module-fuzzy-match-substitution-count-not-as-expected

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!