Classifying Documents into Categories

前端 未结 3 1052
伪装坚强ぢ
伪装坚强ぢ 2020-12-22 17:59

I\'ve got about 300k documents stored in a Postgres database that are tagged with topic categories (there are about 150 categories in total). I have another 150k documents

3条回答
  •  南笙
    南笙 (楼主)
    2020-12-22 18:32

    Is there a way to have a "none of the above" option for the classifier just in case the document doesn't fit into any of the categories?

    You might get this effect simply by having a "none of the above" pseudo-category trained each time. If the max you can train is 5 categories (though I'm not sure why it's eating up quite so much RAM), train 4 actual categories from their actual 2K docs each, and a "none of the above" one with its 2K documents taken randomly from all the other 146 categories (about 13-14 from each if you want the "stratified sampling" approach, which may be sounder).

    Still feels like a bit of a kludge and you might be better off with a completely different approach -- find a multi-dimensional doc measure that defines your 300K pre-tagged docs into 150 reasonably separable clusters, then just assign each of the other yet-untagged docs to the appropriate cluster as thus determined. I don't think NLTK has anything directly available to support this kind of thing, but, hey, NLTK's been growing so fast that I may well have missed something...;-)

提交回复
热议问题