Can I use K-means algorithm on a string?

I am working on a python project where I study RNA structure evolution (represented as a string for example: "(((...)))" where the parenthesis represent basepairs). The point being is that I have an ideal structure and a population that evolves towards the ideal structure. I have implemented everything however I would like to add a feature where I can get the "number of buckets" ie the k most representative structures in the population at each generation.

I was thinking of using the k-means algorithm but I am not sure how to use it with strings. I found scipy.cluster.vq but I don't know how to use it in my case.

thanks!

K-means doesn't really care about the type of the data involved. All you need to do a K-means is some way to measure a "distance" from one item to another. It'll do its thing based on the distances, regardless of how that happens to be computed from the underlying data.

That said, I haven't used scipy.cluster.vq, so I'm not sure exactly how you tell it the relationship between items, or how to compute a distance from item A to item B.

unutbu

One problem you would face if using scipy.cluster.vq.kmeans is that that function uses Euclidean distance to measure closeness. To shoe-horn your problem into one solveable by k-means clustering, you'd have to find a way to convert your strings into numerical vectors and be able to justify using Euclidean distance as a reasonable measure of closeness.

That seems... difficult. Perhaps you are looking for Levenshtein distance instead?

Note there are variants of the K-means algorithm that can work with non-Euclideance distance metrics (such as Levenshtein distance). K-medoids (aka PAM), for instance, can be applied to data with an arbitrary distance metric.

For example, using Pycluster's implementation of k-medoids, and nltk's implementation of Levenshtein distance,

import nltk.metrics.distance as distance
import Pycluster as PC

words = ['apple', 'Doppler', 'applaud', 'append', 'barker', 
         'baker', 'bismark', 'park', 'stake', 'steak', 'teak', 'sleek']

dist = [distance.edit_distance(words[i], words[j]) 
        for i in range(1, len(words))
        for j in range(0, i)]

labels, error, nfound = PC.kmedoids(dist, nclusters=3)
cluster = dict()
for word, label in zip(words, labels):
    cluster.setdefault(label, []).append(word)
for label, grp in cluster.items():
    print(grp)

yields a result like

['apple', 'Doppler', 'applaud', 'append']
['stake', 'steak', 'teak', 'sleek']
['barker', 'baker', 'bismark', 'park']

K-means only works with euclidean distance. Edit distances such as Levenshtein don't ~~even obey the triangle inequality~~ may obey the triangle inequality, but are not euclidian. For the sorts of metrics you're interested in, you're better off using a different sort of algorithm, such as Hierarchical clustering: http://en.wikipedia.org/wiki/Hierarchical_clustering

Alternately, just convert your list of RNA into a weighted graph, with Levenshtein weights at the edges, and then decompose it into a minimum spanning tree. The most connected nodes of that tree will be, in a sense, the "most representative".

来源：https://stackoverflow.com/questions/6293637/can-i-use-k-means-algorithm-on-a-string

标签

python

algorithm

cluster-analysis

bioinformatics

k-means