fuzzy string matching with grep

本秂侑毒 提交于 2019-12-19 18:20:24

问题


I am trying to match rows in a file containing a string say ACTGGGTAAACTA. If I do

grep "ACTGGGTAAACTA" file 

It gives me rows which have exact matches. Is there a way to allow for certain number of mismatches (substitutions, insertions or deletions)? For example, I am looking for sequences

  1. Up to 3 allowed subtitutions like "AGTGGGTAACCAA" etc.

  2. Insertions/deletions (having a partial match like "ACTGGGAAAATAAACTA" or "ACTAAACTA")


回答1:


There used to be a tool called agrep for fuzzy regex matching, but it got abandoned.

http://en.wikipedia.org/wiki/Agrep has a bit of history and links to related tools.

https://github.com/Wikinaut/agrep looks like a revived open source release, but I have not tested it.

Failing that, see if you can find tre-agrep for your distro.




回答2:


You can use tre-agrep and specify the edit distance with the -E switch. For example if you have a file foo:

cat <<< EOF > foo
ACTGGGAAAATAAACTA
ACTAAACTA
ACTGGGTAAACTA
EOF

You can match every line with an edit distance of up to 9 like this:

tre-agrep -s -9 -w ACTGGGTAAACTA foo

Output:

4:ACTGGGAAAATAAACTA
4:ACTAAACTA
0:ACTGGGTAAACTA



回答3:


Short answer: no.

Long answer: As @JDB said, regex is inherently precise. You can manually add in mismatches like [ATGC] instead of A in some spot, but there is no way to only allow a small amount of any mismatches. I suggest that you write your own code to parse it, or try to find a DNA parser somewhere.




回答4:


There's a Python library called fuzzysearch (that I wrote) which provides precisely the required functionality.

Here's some sample code that should work:

from fuzzysearch import find_near_matches

with open('path/to/file', 'r') as f:
    data = f.read()

# 1. search allowing up to 3 substitutions
matches = find_near_matches("ACTGGGTAAACTA", data, max_substitutions=3)

# 2. also allow insertions and deletions, i.e. allow an edit distance
#    a.k.a. Levenshtein distance of up to 3
matches = find_near_matches("ACTGGGTAAACTA", data, max_l_dist=3)


来源:https://stackoverflow.com/questions/30355972/fuzzy-string-matching-with-grep

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!