Clustering Strings Based on Similar Word Sequences

后端 未结 2 1349
深忆病人
深忆病人 2021-01-24 04:24

I am looking for an efficient way to cluster about 10 million strings into clusters based on the appearance of similar word sequences.

Consider a list of strings like:<

2条回答
  •  天命终不由人
    2021-01-24 05:02

    Clustering is the wrong tool for you.

    For any unsupervised algorithm, the following partitioning will be as good:

    the fruit hut number one
    the ice cre am shop number one
    jim's taco outlet number one
    
    the ice cream shop
    the fruit hut
    jim's taco
    
    ice cream shop in the corner
    jim's t aco in the corner
    the fruit hut in the corner
    

    Because to a clustering algorithm "number one" and "in the corner" are also shared phrases. The second cluster are the leftovers.

    Use something supervised instead.

提交回复
热议问题