How to get tfidf with pandas dataframe?

后端未结

关注

 3  1262

I want to calculate tf-idf from the documents below. I\'m using python and pandas.

import pandas as pd
df = pd.DataFrame({\'docId\': [1,2,3],


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  无人共我        
                
              
                            
                2020-12-05 04:42
              
            
            
                                                                       
I found a slightly different method using CountVectorizer from sklearn.
--count vectorizer: Ultraviolet Analysis word frequency
--preprocessing/cleaning text: Usman Malik scraping tweets preprocessing
I won't be covering preprocessing in this answer. Basically what you want to do is import CountVectorizer and fit your data to the CountVectorizer object, which will let you access the .vocabulary._items() feature, which will give you the vocabulary of your dataset (the unique words present and their frequencies, given any limiting parameters you pass into CountVectorizer like match feature number, etc)

Then, you're going to use the Tfidtransformer to generate tf-idf weights for the terms in a similar manner

I am coding in a jupyter notebook file using pandas and the pycharm ide

Here is a code snippet:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np
#https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
countVec = CountVectorizer(max_features= 5000, stop_words='english', min_df=.01, max_df=.90)

#%%
#use CountVectorizer.fit(self, raw_documents[, y] to learn vocabulary dictionary of all tokens in raw documents
#raw documents in this case will betweetsFrameWords["Text"] (processed text)
countVec.fit(tweetsFrameWords["Text"])
#useful debug, get an idea of the item list you generated
list(countVec.vocabulary_.items())

#%%
#convert to bag of words
#sparse matrix representation? (README: could use an edit/explanation)
countVec_count = countVec.transform(tweetsFrameWords["Text"])

#%%
#make array from number of occurrences
occ = np.asarray(countVec_count.sum(axis=0)).ravel().tolist()

#make a new data frame with columns term and occurrences, meaning word and number of occurences
bowListFrame = pd.DataFrame({'term': countVec.get_feature_names(), 'occurrences': occ})
print(bowListFrame)

#sort in order of number of word occurences, most->least. if you leave of ascending flag should default ASC
bowListFrame.sort_values(by='occurrences', ascending=False).head(60)

#%%
#now, convert to a more useful ranking system, tf-idf weights
#TfidfTransformer: scale raw word counts to a weighted ranking using the
#https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
tweetTransformer = TfidfTransformer()

#initial fit representation using transformer object
tweetWeights = tweetTransformer.fit_transform(countVec_count)

#follow similar process to making new data frame with word occurrences, but with term weights
tweetWeightsFin = np.asarray(tweetWeights.mean(axis=0)).ravel().tolist()

#now that we've done Tfid, make a dataframe with weights and names
tweetWeightFrame = pd.DataFrame({'term': countVec.get_feature_names(), 'weight': tweetWeightsFin})
print(tweetWeightFrame)
tweetWeightFrame.sort_values(by='weight', ascending=False).head(20)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  庸人自扰        
                
              
                            
                2020-12-05 05:04
              
            
            
                                                                       
Scikit-learn implementation is really easy : 

from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer()
x = v.fit_transform(df['sent'])


There are plenty of parameters you can specify. See the documentation here

The output of fit_transform will be a sparse matrix, if you want to visualize it you can do x.toarray()

In [44]: x.toarray()
Out[44]: 
array([[ 0.64612892,  0.38161415,  0.        ,  0.38161415,  0.38161415,
         0.        ,  0.38161415],
       [ 0.        ,  0.38161415,  0.64612892,  0.38161415,  0.38161415,
         0.        ,  0.38161415],
       [ 0.        ,  0.38161415,  0.        ,  0.38161415,  0.38161415,
         0.64612892,  0.38161415]])

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  轻奢々        
                
              
                            
                2020-12-05 05:05
              
            
            
                                                                       
A simple solution is to use texthero:

import texthero as hero
df['tfidf'] = hero.tfidf(df['sent'])


In [5]: df.head()
Out[5]:
   docId                         sent                                              tfidf
0      1   This is the first sentence  [0.3816141458138271, 0.6461289150464732, 0.381...
1      2  This is the second sentence  [0.3816141458138271, 0.0, 0.3816141458138271, ...
2      3   This is the third sentence  [0.3816141458138271, 0.0, 0.3816141458138271, ...

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复