I am looking for an efficient way to cluster about 10 million strings into clusters based on the appearance of similar word sequences.
Consider a list of strings like:<
Clustering is the wrong tool for you.
For any unsupervised algorithm, the following partitioning will be as good:
the fruit hut number one
the ice cre am shop number one
jim's taco outlet number one
the ice cream shop
the fruit hut
jim's taco
ice cream shop in the corner
jim's t aco in the corner
the fruit hut in the corner
Because to a clustering algorithm "number one" and "in the corner" are also shared phrases. The second cluster are the leftovers.
Use something supervised instead.