how to generate a set of similar strings in python

ぃ、小莉子 提交于 2020-01-04 13:35:55

问题


I am wondering how to generate a set of similar strings based on Levenshtein distance (string edit distance). Ideally, I like to pass in, a source string (i.e. a string which is used to generate other strings that are similar to it), the number of strings need to be generated and a threshold as parameters, i.e. similarities among the strings in the generated set should be greater than the threshold. I am wondering what Python package(s) should I use to achieve that? Or any idea how to implement this?


回答1:


I think you can think of the problem in another way (reversed).

  • Given a string, say it is sittin.
  • Given a threshold (edit distance), say it is k.
  • Then you apply combinations of different "edits" in k-steps.

For example, let's say k = 2. And assume the allowed edit modes you have are:

  • delete one character
  • add one character
  • substitute one character with another one.

Then the logic is something like below:

input = 'sittin'
for num in 1 ... n:  # suppose you want to have n strings generated
  my_input_ = input
  # suppose the edit distance should be smaller or equal to k;
  # but greater or equal to one
  for i in in 1 ... randint(k): 
    pick a random edit mode from (delete, add, substitute)
    do it! and update my_input_

If you need to stick with a pre-defined dictionary, that adds some complexity but it is still doable. In this case, the edit must be valid.




回答2:


Borrowing heavily on the pseudocode in @greeness answer I thought I would include the code I used to do this for DNA sequences.
This may not be your exact use case but I think it should be easily adaptable.

import random
dna = set(["A", "C", "G", "T"])

class Sequence(str):

    def mutate(self, d, n):
        mutants = set([self])
        while len(mutants) < n:
            k = random.randint(1, d)
            for _ in range(k):
                mutant_type = random.choice(["d", "s", "i"])
                if mutant_type == "i":
                    mutants.add(self.insertion(k))
                elif mutant_type == "d":
                    mutants.add(self.deletion(k))
                elif mutant_type == "s":
                    mutants.add(self.substitute(k))
        return list(mutants)


    def deletion(self, n):
        if n >= len(self):
            return ""
        chars = list(self)
        i = 0
        while i < n:
            idx = random.choice(range(len(chars)))
            del chars[idx]
            i += 1
        return "".join(chars)

    def insertion(self, n):
        chars = list(self)
        i = 0
        while i < n:
            idx = random.choice(range(len(chars)))
            new_base = random.choice(list(dna))
            chars.insert(idx, new_base)
            i += 1
        return "".join(chars)

    def substitute(self, n):
        idxs = random.sample(range(len(self)), n)
        chars = list(self)
        for i in idxs:
            new_base = random.choice(list(dna.difference(chars[i])))
            chars[i] = new_base
        return "".join(chars)

To use this you can do the following

s = Sequence("AAAAA")
d = 2  # max edit distance
n = 5  # number of strings in result
s.mutate(d, n)
>>> ['AAA', 'GACAAAA', 'AAAAA', 'CAGAA', 'AACAAAA']


来源:https://stackoverflow.com/questions/40358855/how-to-generate-a-set-of-similar-strings-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!