Perform Chi-2 feature selection on TF and TF*IDF vectors

北城以北 提交于 2019-12-02 18:33:11

The χ² features selection code builds a contingency table from its inputs X (feature values) and y (class labels). Each entry i, j corresponds to some feature i and some class j, and holds the sum of the i'th feature's values across all samples belonging to the class j. It then computes the χ² test statistic against expected frequencies arising from the empirical distribution over classes (just their relative frequencies in y) and a uniform distribution over feature values.

This works when the feature values are frequencies (of terms, for example) because the sum will be the total frequency of a feature (term) in that class. There's no discretization going on.

It also works quite well in practice when the values are tf-idf values, since those are just weighted/scaled frequencies.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!