What does the Brown clustering algorithm output mean?

前端未结

关注

 5  830

暗喜 2020-12-25 15:05

I\'ve ran the brown-clustering algorithm from https://github.com/percyliang/brown-cluster and also a python implementation https://github.com/mheilman/tan-clustering. And th

5条回答

星月不相逢 (楼主)

2020-12-25 15:38
The integers are counts of how many times the word is seen in the document. (I have tested this in the python implementation.)

From the comments at the top of the python implementation:

Instead of using a window (e.g., as in Brown et al., sec. 4), this code computed PMI using the probability that two randomly selected clusters from the same document will be c1 and c2. Also, since the total numbers of cluster tokens and pairs are constant across pairs, this code use counts instead of probabilities.

From the code in the python implementation we see that it outputs the word, the bit string and the word counts.
```
def save_clusters(self, output_path):
    with open(output_path, 'w') as f:
        for w in self.words:
            f.write("{}\t{}\t{}\n".format(w, self.get_bitstring(w),
                                          self.word_counts[w]))
```
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...