Algorithm to to Cluster Similar Strings in Python?

跟風遠走 提交于 2019-12-11 09:48:29

问题


I'm working on a script that currently contains multiple lists of DNA sequences (each list has a varying number of DNA sequences) and I need to cluster the sequences in each list based on Hamming Distance similarity. My current implementation of this (very crude at the moment) extracts the first sequence in the list and calculates the Hamming Distance of each subsequent sequence. If it's within a certain Hamming Distance, it appends it to a new list which later is used to remove sequences from the original list as well as store the similar sequences in a defaultdict. See my current implementation of my code below:

def hamming_dist(sequence1, sequence2):
"""
Calculates the hamming distance between 2 sequences
"""
    assert len(sequence1) == len(sequence2)
    return sum(sequence1 !=sequence2 for sequence1,sequence2 \
    in itertools.izip(sequence1,sequence2))


def group_sequences(sequences_list):
    trash_sequences = []
    main_sequence = sequences_list[0]
    clustered_sequence = defaultdict(list)
    while len(sequences_list) > 1:
        for sequence in sequences_list:
            ham_dist = hamming_dist(main_sequence,sequence)
            if hamming_dist < 30:
                trash_sequences.append(sequence)

        for similar_sequences in trash_sequences:
            sequences_list.remove(similar_sequences)
            clustered_sequence[main_tcr].append(similar_sequences)
    else:
        clustered_sequence[main_sequence].append(None)
    return clustered_sequence

Obviously this isn't the best or most efficient way to do it and even with this implementation, there are still some bugs in my script that I encountered. I read through over StackOverflow/StackExchange questions to see if other people have encountered my problem and of the similar questions I found, many other people have mentioned about using algorithms such as the K-Means algorithm, Markov Clustering method, hierarchy clustering, etc. I'm not too familiar with any of these methods except the K-means method which requires numbers, not strings.

Which clustering method(s) would you suggest I implement to cluster similar DNA sequences together as well as the best way to implement your preferred method of choice? Any suggestions would be much appreciated!


回答1:


The best choice to get started is hierarchical clustering.

It's easy to understand, it allows any distance, and it can visualize the clustering as a tree even when the data itself is hard to visualize.



来源:https://stackoverflow.com/questions/49101659/algorithm-to-to-cluster-similar-strings-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!