TFIDF for Large Dataset

后端未结

关注

 3  678

抹茶落季 2020-12-07 22:38

I have a corpus which has around 8 million news articles, I need to get the TFIDF representation of them as a sparse matrix. I have been able to do that using scikit-learn f

3条回答

[愿得一人] (楼主)

2020-12-07 23:20

Gensim has an efficient tf-idf model and does not need to have everything in memory at once.

Your corpus simply needs to be an iterable, so it does not need to have the whole corpus in memory at a time.

The make_wiki script runs over Wikipedia in about 50m on a laptop according to the comments.

0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...