TFIDF for Large Dataset

后端未结

关注

 3  714

抹茶落季 2020-12-07 22:38

I have a corpus which has around 8 million news articles, I need to get the TFIDF representation of them as a sparse matrix. I have been able to do that using scikit-learn f

3条回答

旧巷少年郎 (楼主)

2020-12-07 23:13
I solve that problem using sklearn and pandas.

Iterate in your dataset once using pandas iterator and create a set of all words, after that use it in CountVectorizer vocabulary. With that the Count Vectorizer will generate a list of sparse matrix all of them with the same shape. Now is just use vstack to group them. The sparse matrix resulted have the same information (but the words in another order) as CountVectorizer object and fitted with all your data.

That solution is not the best if you consider the time complexity but is good for memory complexity. I use that in a dataset with 20GB +,

I wrote a python code (NOT THE COMPLETE SOLUTION) that show the properties, write a generator or use pandas chunks for iterate in your dataset.
```
from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import vstack


# each string is a sample
text_test = [
    'good people beauty wrong',
    'wrong smile people wrong',
    'idea beauty good good',
]

# scikit-learn basic usage

vectorizer = CountVectorizer()

result1 = vectorizer.fit_transform(text_test)
print(vectorizer.inverse_transform(result1))
print(f"First approach:\n {result1}")

# Another solution is

vocabulary = set()

for text in text_test:
    for word in text.split():
        vocabulary.add(word)

vectorizer = CountVectorizer(vocabulary=vocabulary)

outputs = [] 
for text in text_test: # use a generator
    outputs.append(vectorizer.fit_transform([text]))


result2 = vstack(outputs)
print(vectorizer.inverse_transform(result2))

print(f"Second approach:\n {result2}")
```
Finally, use TfidfTransformer.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...