TFIDF for Large Dataset

后端 未结 3 694
抹茶落季
抹茶落季 2020-12-07 22:38

I have a corpus which has around 8 million news articles, I need to get the TFIDF representation of them as a sparse matrix. I have been able to do that using scikit-learn f

3条回答
  •  旧巷少年郎
    2020-12-07 23:13

    I solve that problem using sklearn and pandas.

    Iterate in your dataset once using pandas iterator and create a set of all words, after that use it in CountVectorizer vocabulary. With that the Count Vectorizer will generate a list of sparse matrix all of them with the same shape. Now is just use vstack to group them. The sparse matrix resulted have the same information (but the words in another order) as CountVectorizer object and fitted with all your data.

    That solution is not the best if you consider the time complexity but is good for memory complexity. I use that in a dataset with 20GB +,

    I wrote a python code (NOT THE COMPLETE SOLUTION) that show the properties, write a generator or use pandas chunks for iterate in your dataset.

    from sklearn.feature_extraction.text import CountVectorizer
    from scipy.sparse import vstack
    
    
    # each string is a sample
    text_test = [
        'good people beauty wrong',
        'wrong smile people wrong',
        'idea beauty good good',
    ]
    
    # scikit-learn basic usage
    
    vectorizer = CountVectorizer()
    
    result1 = vectorizer.fit_transform(text_test)
    print(vectorizer.inverse_transform(result1))
    print(f"First approach:\n {result1}")
    
    # Another solution is
    
    vocabulary = set()
    
    for text in text_test:
        for word in text.split():
            vocabulary.add(word)
    
    vectorizer = CountVectorizer(vocabulary=vocabulary)
    
    outputs = [] 
    for text in text_test: # use a generator
        outputs.append(vectorizer.fit_transform([text]))
    
    
    result2 = vstack(outputs)
    print(vectorizer.inverse_transform(result2))
    
    print(f"Second approach:\n {result2}")
    

    Finally, use TfidfTransformer.

提交回复
热议问题