TFIDF for Large Dataset

后端 未结 3 678
抹茶落季
抹茶落季 2020-12-07 22:38

I have a corpus which has around 8 million news articles, I need to get the TFIDF representation of them as a sparse matrix. I have been able to do that using scikit-learn f

3条回答
  •  [愿得一人]
    2020-12-07 23:20

    Gensim has an efficient tf-idf model and does not need to have everything in memory at once.

    Your corpus simply needs to be an iterable, so it does not need to have the whole corpus in memory at a time.

    The make_wiki script runs over Wikipedia in about 50m on a laptop according to the comments.

提交回复
热议问题