Clustering Strings Based on Similar Word Sequences
问题 I am looking for an efficient way to cluster about 10 million strings into clusters based on the appearance of similar word sequences. Consider a list of strings like: the fruit hut number one the ice cre am shop number one jim's taco ice cream shop in the corner the ice cream shop the fruit hut jim's taco outlet number one jim's t aco in the corner the fruit hut in the corner After the algorithm runs on them I want them clustered as follows: the ice cre am shop number one ice cream shop in