Better text documents clustering than tf/idf and cosine similarity?

后端 未结 3 443
走了就别回头了
走了就别回头了 2021-02-01 06:56

I\'m trying to cluster the Twitter stream. I want to put each tweet to a cluster that talk about the same topic. I tried to cluster the stream using an online clustering algorit

3条回答
  •  甜味超标
    2021-02-01 07:16

    As mentioned in other comments and answers. Using LDA can give good tweet->topic weights.

    If these weights are insufficient clustering for your needs you could look at clustering these topic distributions using a clustering algorithm.

    While it is training set dependent LDA could easily bundle tweets with stackoverflow, stack-overflow and stack overflow into the same topic. However "my stack of boxes is about to overflow" might instead go into another topic about boxes.

    Another example: A tweet with the word Apple could go into a number of different topics (the company, the fruit, New York and others). LDA would look at the other words in the tweet to determine the applicable topics.

    1. "Steve Jobs was the CEO at Apple" is clearly about the company
    2. "I'm eating the most delicious apple" is clearly about the fruit
    3. "I'm going to the big apple when I travel to the USA" is most likely about visiting New York

提交回复
热议问题