Counting the Frequency of words in a pandas data frame

后端未结

关注

 3  1135

I have a table like below:

      URN                   Firm_Name
0  104472               R.X. Yah & Co
1  104873        Big Building Society
2  109986


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  深忆病人        
                
              
                            
                2020-12-04 18:28
              
            
            
                                                                       
You need str.cat with lower first for concanecate all values to one string, then need word_tokenize and last use your solution:

top_N = 4
#if not necessary all lower
a = data['Firm_Name'].str.lower().str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(a)
word_dist = nltk.FreqDist(words)
print (word_dist)
<FreqDist with 17 samples and 20 outcomes>

rslt = pd.DataFrame(word_dist.most_common(top_N),
                    columns=['Word', 'Frequency'])
print(rslt)
      Word  Frequency
0  society          3
1      ltd          2
2      the          1
3       co          1


Also is possible remove lower if necessary:

top_N = 4
a = data['Firm_Name'].str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(a)
word_dist = nltk.FreqDist(words)
rslt = pd.DataFrame(word_dist.most_common(top_N),
                    columns=['Word', 'Frequency'])
print(rslt)
         Word  Frequency
0     Society          3
1         Ltd          2
2         MMV          1
3  Kensington          1

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  日久生厌        
                
              
                            
                2020-12-04 18:29
              
            
            
                                                                       
IIUIC, use value_counts()

In [3361]: df.Firm_Name.str.split(expand=True).stack().value_counts()
Out[3361]:
Society       3
Ltd           2
James's       1
R.X.          1
Yah           1
Associates    1
St            1
Kensington    1
MMV           1
Big           1
&             1
The           1
Co            1
Oil           1
Building      1
dtype: int64




Or,

pd.Series(np.concatenate([x.split() for x in df.Firm_Name])).value_counts()




Or,

pd.Series(' '.join(df.Firm_Name).split()).value_counts()




For top N, for example 3

In [3379]: pd.Series(' '.join(df.Firm_Name).split()).value_counts()[:3]
Out[3379]:
Society    3
Ltd        2
James's    1
dtype: int64




Details

In [3380]: df
Out[3380]:
      URN                   Firm_Name
0  104472               R.X. Yah & Co
1  104873        Big Building Society
2  109986          St James's Society
3  114058  The Kensington Society Ltd
4  113438      MMV Oil Associates Ltd

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  时光说笑        
                
              
                            
                2020-12-04 18:47
              
            
            
                                                                       
This answer can also be used - Count distinct words from a Pandas Data Frame. It utilizes the Counter method and applies it to each row.

from collections import Counter
c = Counter()
df = pd.DataFrame(
    [[104472,"R.X. Yah & Co"],
    [104873,"Big Building Society"],
    [109986,"St James's Society"],
    [114058,"The Kensington Society Ltd"],
    [113438,"MMV Oil Associates Ltd"]
], columns=["URN","Firm_Name"])
df.Firm_Name.str.split().apply(c.update)

Counter({'R.X.': 1,
         'Yah': 1,
         '&': 1,
         'Co': 1,
         'Big': 1,
         'Building': 1,
         'Society': 3,
         'St': 1,
         "James's": 1,
         'The': 1,
         'Kensington': 1,
         'Ltd': 2,
         'MMV': 1,
         'Oil': 1,
         'Associates': 1})

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复